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Abstract 

Decision-theoretic planning is a popular approach to sequential decision making prob- 
lems, because it treats uncertainty in sensing and acting in a principled way. In single-agent 
frameworks like MDPs and POMDPs, planning can be carried out by resorting to Q-value 
functions: an optimal Q-value function Q* is computed in a recursive manner by dynamic 
programming, and then an optimal policy is extracted from Q* . In this paper we study 
whether similar Q-value functions can be defined for decentralized POMDP models (Dec- 
POMDPs), and how policies can be extracted from such value functions. We define two 
forms of the optimal Q-value function for Dec-POMDPs: one that gives a normative de- 
scription as the Q-value function of an optimal pure joint policy and another one that is 
sequentially rational and thus gives a recipe for computation. This computation, however, 
is infeasible for all but the smallest problems. Therefore, we analyze various approximate 
Q-value functions that allow for efficient computation. We describe how they relate, and 
we prove that they all provide an upper bound to the optimal Q-value function Q* . Finally, 
unifying some previous approaches for solving Dec-POMDPs, we describe a family of al- 
gorithms for extracting policies from such Q-value functions, and perform an experimental 
evaluation on existing test problems, including a new firefighting benchmark problem. 



1. Introduction 



One of the main goals in artificial intelligence (AI) is the development of intelligent agents, 
which perceive their environment through sensors and influence the environment through 
their actuators. In this setting, an essential problem is how an agent should decide which 
action to perform in a certain situation. In this work, we focus on planning: constructing 
a plan that specifies which action to take in each situation the agent might encounter over 
time. In particular, we will focus on planning in a cooperative multiagent system (MAS): 
an environment in which multiple agents coexist and interact in order to perform a joint 
task. We will adopt a decision-theoretic approach, which allows us to tackle uncertainty in 
sensing and acting in a principled way. 

Decision-theoretic planning has roots in control theory and in operations research. In 
control theory, one or more controllers control a stochastic system with a specific output 
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as goal. Operations research considers tasks related to scheduling, logistics and work flow 
and tries to optimize the concerning systems. Decision-theoretic planning problems can be 
formalized as Markov decision processes (MDPs) , which have have been frequently employed 
in both control theory as well as operations research, but also have been adopted by AI for 
planning in stochastic environments. In all these fields the goal is to find a (conditional) 
plan, or policy, that is optimal with respect to the desired behavior. Traditionally, the main 
focus has been on systems with only one agent or controller, but in the last decade interest 
in systems with multiple agents or decentralized control has grown. 

A different, but also related field is that of game theory. Game theory considers agents, 
called players, interacting in a dynamic, potentially stochastic process, the game. The goal 
here is to find optimal strategies for the agents, that specify how they should play and 
therefore correspond to policies. In contrast to decision-theoretic planning, game theory 
has always considered multiple agents, and as a consequence several ideas and concepts 
from game theory are now being applied in decentralized decision-theoretic planning. In 
this work we apply game-theoretic models to decision-theoretic planning for multiple agents. 

1.1 Decision-Theoretic Planning 

In the last decades, the Markov decision process (MDP) framework has gained in popularity 
in the AI community as a model for planning under uncertainty (Boutilier, Dean, & Hanks, 
1999; Guestrin, Koller, Parr, & Venkataraman, 2003). MDPs can be used to formalize a 
discrete time planning task of a single agent in a stochastically changing environment, on 
the condition that the agent can observe the state of the environment. Every time step 
the state changes stochastically, but the agent chooses an action that selects a particular 
transition function. Taking an action from a particular state at time step t induces a 
probability distribution over states at time step t + 1. 

The agent's objective can be formulated in several ways. The first type of objective 
of an agent is reaching a specific goal state, for example in a maze in which the agent's 
goal is to reach the exit. A different formulation is given by associating a certain cost with 
the execution of a particular action in a particular state, in which case the goal will be 
to minimize the expected total cost. Alternatively, one can associate rewards with actions 
performed in a certain state, the goal being to maximize the total reward. 

When the agent knows the probabilities of the state transitions, i.e., when it knows the 
model, it can contemplate the expected transitions over time and construct a plan that is 
most likely to reach a specific goal state, minimizes the expected costs or maximizes the 
expected reward. This stands in some contrast to reinforcement learning (RL) (Sutton & 
Barto, 1998), where the agent does not have a model of the environment, but has to learn 
good behavior by repeatedly interacting with the environment. Reinforcement learning 
can be seen as the combined task of learning the model of the environment and planning, 
although in practice often it is not necessary to explicitly recover the environment model. In 
this article we focus only on planning, but consider two factors that complicate computing 
successful plans: the inability of the agent to observe the state of the environment as well 
as the presence of multiple agents. 

In the real world an agent might not be able to determine what the state of the envi- 
ronment exactly is, because the agent's sensors are noisy and/or limited. When sensors are 
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noisy, an agent can receive faulty or inaccurate observations with some probability. When 
sensors are limited the agent is unable to observe the differences between states that cannot 
be detected by the sensor, e.g., the presence or absence of an object outside a laser range- 
finder's field of view. When the same sensor reading might require different action choices, 
this phenomenon is referred to as perceptual aliasing. In order to deal with the introduced 
sensor uncertainty, a partially observable Markov decision process (POMDP) extends the 
MDP model by incorporating observations and their probability of occurrence conditional 
on the state of the environment (Kaelbling, Liftman, &; Cassandra, 1998). 

The other complicating factor we consider is the presence of multiple agents. Instead of 
planning for a single agent we now plan for a team of cooperative agents. We assume that 
communication within the team is not possible.^ A major problem in this setting is how the 
agents will have to coordinate their actions. Especially, as the agents are not assumed to 
observe the state — each agent only knows its own observations received and actions taken — 
there is no common signal they can condition their actions on. Note that this problem is 
in addition to the problem of partial observability, and not a substitution of it; even if 
the agents could freely and instantaneously communicate their individual observations, the 
joint observations would not disambiguate the true state. 

One option is to consider each agent separately, and have each such agent maintain 
an explicit model of the other agents. This is the approach as chosen in the Interactive 
POMDP (I-POMDP) framework (Gmytrasiewicz & Doshi, 2005). A problem in this ap- 
proach, however, is that the other agents also model the considered agent, leading to an 
infinite recursion of beliefs regarding the behavior of agents. We will adopt the decentralized 
partially observable Markov decision process (Dec-POMDP) model for this class of problems 
(Bernstein, Givan, Immerman, & Zilberstein, 2002). A Dec-POMDP is a generalization to 
multiple agents of a POMDP and can be used to model a team of cooperative agents that 
are situated in a stochastic, partially observable environment. 

The single-agent MDP setting has received much attention, and many results are known. 
In particular it is known that an optimal plan, or policy, can be extracted from the optimal 
action- value, or Q-value, function Q*{s,a), and that the latter can be calculated efficiently. 
For POMDPs, similar results are available, although finding an optimal solution is harder 
(PSPACE-complete for finite-horizon problems, Papadimitriou & Tsitsiklis, 1987). 

On the other hand, for Dec-POMDPs relatively little is known except that they are 
provably intractable (NEXP-complete, Bernstein et al., 2002). In particular, an outstanding 
issue is whether Q-value functions can be defined for Dec-POMDPs just as in (PO)MDPs, 
and whether policies can be extracted from such Q-value functions. Currently most al- 
gorithms for planning in Dec-POMDPs are based on some version of policy search (Nair, 
Tambe, Yokoo, Pynadath, & Marsella, 2003b; Hansen, Bernstein, &: Zilberstein, 2004; Szer, 
Charpillet, & Zilberstein, 2005; Varakantham, Marecki, Yabu, Tambe, & Yokoo, 2007), and 
a proper theory for Q-value functions in Dec-POMDPs is still lacking. Given the wide range 
of applications of value functions in single-agent decision-theoretic planning, we expect that 
such a theory for Dec-POMDPs can have great benefits, both in terms of providing insight 
as well as guiding the design of solution algorithms. 

1. As it turns out, the framework we consider can also model communication with a particular cost 
that is subject to minimization (Pynadath & Tambe, 2002; Goldman & Zilberstein, 2004). The non- 
communicative setting can be interpreted as the special case with infinite cost. 
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1.2 Contributions 

In this paper we develop theory for Q-value functions in Dec-POMDPs, showing that an 
optimal Q-function Q* can be defined for a Dec-POMDP. We define two forms of the optimal 
Q-value function for Dec-POMDPs: one that gives a normative description as the Q-value 
function of an optimal pure joint policy and another one that is sequentially rational and 
thus gives a recipe for computation. We also show that given Q* , an optimal policy can 
be computed by forward-sweep policy computation, solving a sequence of Bayesian games 
forward through time (i.e., from the first to the last time step), thereby extending the 
solution technique of Emery-Montemerlo, Gordon, Schneider, and Thrun (2004) to the 
exact setting. 

Computation of Q* is infeasible for all but the smallest problems. Therefore, we analyze 
three different approximate Q-value functions Qmdpj Qpomdp Qbg that can be more 
efficiently computed and which constitute upper bounds to Q* . We also describe a gener- 
alized form of Qbg ^^^^ includes Qpomdp ' Qbg ^'^d Q* . This is used to prove a hierarchy 
of upper bounds: Q* < Qbg < Qpomdp < Qmdp- 

Next, we show how these approximate Q-value functions can be used to compute optimal 
or sub-optimal policies. We describe a generic policy search algorithm, which we dub 
Generalized MAA* (GMAA*) as it is a generalization of the MAA* algorithm by Szer et al. 
(2005), that can be used for extracting a policy from an approximate Q-value function. By 
varying the implementation of a sub-routine of this algorithm, this algorithm unifies MAA* 
and forward-sweep policy computation and thus the approach of Emery-Montemerlo et al. 
(2004). 

Finally, in an experimental evaluation we examine the differences between Qmdp; 
Qpomdp ) Qbg Q* for several problems. We also experimentally verify the potential 
benefit of tighter heuristics, by testing different settings of GMAA* on some well known 
test problems and on a new benchmark problem involving firefighting agents. 

This article is based on previous work by Oliehoek and Vlassis (2007) — abbreviated OV 
here — containing several new contributions: (1) Contrary to the OV work, the current work 
includes a section on the sequential rational description of Q* and suggests a way to compute 
Q* in practice (OV only provided a normative description of Q*). (2) The current work 
provides a formal proof of the hierarchy of upper bounds to Q* (which was only qualitatively 
argued in the OV paper). (3) The current article additionally contains a proof that the 
solutions for the Bayesian games with identical payoffs given by equation (4.2) constitute 
Pareto optimal Nash equilibria of the game (which was not proven in the OV paper). (4) 
This article contains a more extensive experimental evaluation of the derived bounds of 
Q*, and introduces a new benchmark problem (firefighting). (5) Finally, the current article 
provides a more complete introduction to Dec-POMDPs and existing solution methods, as 
well as Bayesian games, hence it can serve as a self-contained introduction to Dec-POMDPs. 

1.3 Applications 

Although the field of multiagent systems in a stochastic, partially observable environment 
seems quite specialized and thus narrow, the application area is actually very broad. The 
real world is practically always partially observable due to sensor noise and perceptual 
aliasing. Also, in most of these domains communication is not free, but consumes resources 
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and thus has a particular cost. Therefore models as Dec-POMDPs, which do consider 
partially observable environments are relevant for essentially all teams of embodied agents. 

Example applications of this type are given by Emery-Montemerlo (2005), who con- 
sidered multi-robot navigation in which a team of agents with noisy sensors has to act to 
find/capture a goal. Becker, Zilberstein, Lesser, and Goldman (2004b) use a multi-robot 
space exploration example. Here, the agents are Mars rovers and have to decide on how to 
proceed their mission: whether to collect particular samples at specific sites or not. The 
rewards of particular samples can be sub- or super- additive, making this task non-trivial. 
An overview of application areas in cooperative robotics is presented by Aral, Pagello, and 
Parker (2002), among which is robotic soccer, as applied in RoboCup (Kitano, Asada, Ku- 
niyoshi, Noda, & Osawa, 1997). Another application that is investigated within this project 
is crisis management: RoboCup Rescue (Kitano, Tadokoro, Noda, Matsubara, Takahashi, 
Shinjoh, &; Shimada, 1999) models a situation where rescue teams have to perform a search 
and rescue task in a crisis situation. This task also has been modeled as a partially observ- 
able system (Nair, Tambe, & Marseha, 2002, 2003, 2003a; Ohehoek & Visser, 2006; Paquet, 
Tobin, & Chaib-draa, 2005). 

There are also many other types of applications. Nair, Varakantham, Tambe, and 
Yokoo (2005), Lesser, Ortiz Jr., and Tambe (2003) give applications for distributed sensor 
networks (typically used for surveillance). An example of load balancing among queues is 
presented by Cogill, Rotkowitz, Roy, and Lall (2004). Here agents represent queues and 
can only observe queue sizes of themselves and immediate neighbors. They have to decide 
whether to accept new jobs or pass them to another queue. Another frequently considered 
application domain is communication networks. Peshkin (2001) treated a packet routing 
application in which agents are routers and have to minimize the average transfer time of 
packets. They are connected to immediate neighbors and have to decide at each time step 
to which neighbor to send each packet. Other approaches to communication networks using 
decentralized, stochastic, partially observable systems are given by Ooi and Wornell (1996), 
Tao, Baxter, and Weaver (2001), Altman (2002). 

1.4 Overview of Article 

The rest of this article is organized as follows. In Section 2 we will first formally introduce 
the Dec-POMDP model and provide background on its components. Some existing solution 
methods are treated in Section 3. Then, in Section 4 we show how a Dec-POMDP can be 
modeled as a series of Bayesian games and how this constitutes a theory of Q-value functions 
for BGs. We also treat two forms of optimal Q-value functions, Q* , here. Approximate 
Q-value functions are described in Section 5 and one of their applications is discussed in 
Section 6. Section 7 presents the results of the experimental evaluation. Finally, Section 8 
concludes. 

2. Decentralized POMDPs 

In this section we define the Dec-POMDP model and discuss some of its properties. Intu- 
itively, a Dec-POMDP models a number of agents that inhabit a particular environment, 
which is considered at discrete time steps, also referred to as stages (Boutilier et al., 1999) or 
(decision) epochs (Puterman, 1994). The number of time steps the agents will interact with 
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their environment is called the horizon of the decision problem, and will be denoted by h. In 
this paper the horizon is assumed to be finite. At each stage t = 0,1,2, . . . ,h — l every agent 
takes an action and the combination of these actions influences the environment, causing a 
state transition. At the next time step, each agent first receives an observation of the envi- 
ronment, after which it has to take an action again. The probabilities of state transitions 
and observations are specified by the Dec-POMDP model, as are rewards received for par- 
ticular actions in particular states. The transition- and observation probabilities specify the 
dynamics of the environment, while the rewards specify what behavior is desirable. Hence, 
the reward model defines the agents' goal or task: the agents have to come up with a plan 
that maximizes the expected long term reward signal. In this work we assume that planning 
takes place off-line, after which the computed plans are distributed to the agents, who then 
merely execute the plans on-line. That is, computation of the plan is centralized, while 
execution is decentralized. In the centralized planning phase, the entire model as detailed 
below is available. During execution each agent only knows the joint policy as found by the 
planning phase and its individual history of actions and observations. 

2.1 Formal Model 

In this section we more formally treat the basic components of a Dec-POMDP. We start by 
giving a mathematical definition of these components. 

Definition 2.1. A decentralized partially observable Markov decision process (Dec- 
POMDP) is defined as a tuple {n,S,A,T,R,0,0,h,b'^) where: 

• n is the number of agents. 

• 5 is a finite set of states. 

• ^ is the set of joint actions. 

• T is the transition function. 

• R is the immediate reward function. 

• O is the set of joint observations. 

• O is the observation function. 

• h is the horizon of the problem. 

• 6^ € 'P{S), is the initial state distribution at time t = 0.^ 

The Dec-POMDP model extends single-agent (PO)MDP models by considering joint 
actions and observations. In particular, we define A = x as the set ol joint actions. Here, 
Ai is the set of actions available to agent i. Every time step, one joint action a = (ai,. ..,«„,) is 
taken. In a Dec-POMDP, agents only know their own individual action; they do not observe 
each other's actions. We will assume that any action ai € Ai can be selected at any time. 
So the set Ai does not depend on the stage or state of the environment. In general, we will 

2. V{-) denotes the set of probability distributions over (■). 
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denote the stage using superscripts, so a denotes the joint action taken at stage t, a\ is the 
individual action of agent i taken at stage t. Also, we write a^i = (ai, . . . ,aj_i,aj+i, . . . ,0^) 
for a profile of actions for all agents but i. 

Similarly to the set of joint actions, O = XiOi is the set of joint observations, where 
Oi is a set of observations available to agent i. Every time step the environment emits one 
joint observation o = (oi,...,o„), from which each agent i only observes its own component 
Oi, as illustrated by Figure 1. Notation with respect to time and indices for observations 
is analogous to the notation for actions. In this paper, we will assume that the action- 
and observation sets are finite. Infinite action- and observation sets are very difficult to 
deal with even in the single-agent case, and to the authors' knowledge no research has been 
performed on this topic in the partially observable, multiagent case. 

Actions and observations are the interface between the agents and their environment. 
The Dec-POMDP framework describes this environment by its states and transitions. This 
means that rather than considering a complex, typically domain-dependent model of the 
environment that explains how this environment works, a descriptive stance is taken: A 
Dec-POMDP specifies an environment model simply as the set of states S = |si,...,S|5| | 
the environment can be in, together with the probabilities of state transitions that are 
dependent on executed joint actions. In particular, the transition from some state to a next 
state depends stochastically on the past states and actions. This probabilistic dependence 
models outcome uncertainty: the fact that the outcome of an action cannot be predicted 
with full certainty. 

An important characteristic of Dec-POMDPs is that the states possess the Markov 
property. That is, the probability of a particular next state depends on the current state 
and joint action, but not on the whole history: 



Also, we will assume that the transition probabilities are stationary, meaning that they are 
independent of the stage t. 

In a way similar to how the transition model T describes the stochastic influence of 
actions on the environment, the observation model O describes how the state of the envi- 
ronment is perceived by the agents. Formally, O is the observation function, a mapping 
from joint actions and successor states to probability distributions over joint observations: 
O -.AxS ^ V{0). I.e., it specifies 



This implies that the observation model also satisfies the Markov property (there is no 
dependence on the history). Also the observation model is assumed stationary: there is no 
dependence on the stage t. 

Literature has identified different categories of observability (Pynadath & Tambe, 2002; 
Goldman & Zilberstein, 2004). When the observation function is such that the individual 
observation for all the agents will always uniquely identify the true state, the problem 
is considered fully- or individually observable. In such a case, a Dec-POMDP effectively 
reduces to a multiagent MDP (MMDP) as described by Boutilier (1996). The other extreme 
is when the problem is non-observable, meaning that none of the agents observes any useful 
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Figure 1: An illustration of the dynamics of a Dec-POMDP. At every stage the environment 
is in a particular state. This state emits a joint observation, of which each agent 
observes its individual observation. Then each agent selects an action forming 
the joint action. 



information. This is modeled by the fact that agents always receive a null-observation, 
Vj Oi = {oj^o}. Under non-observability agents can only employ an open-loop plan. Between 
these two extremes there are partially observable problems. One more special case has been 
identified, namely the case where not the individual, but the joint observation identifies the 
true state. This case is referred to as jointly- or collectively observable. A jointly observable 
Dec-POMDP is referred to as a Dec-MDP. 

The reward function R{s,a) is used to specify the goal of the agents and is a func- 
tion of states and joint actions. In particular, a desirable sequence of joint actions should 
correspond to a high 'long-term' reward, formalized as the return. 

Definition 2.2. Let the return or cumulative reward of a Dec-POMDP be defined as total 
of the rewards received during an execution: 

r(0) + r(l) + --- + r(/i-l), (2.3) 

where r(t) is the reward received at time step t. 

When, at stage t, the state is s* and the taken joint action is a*, we have that r{t) = 
i?(s*,a). Therefore, given the sequence of states and taken joint actions, it is straightforward 
to determine the return by substitution of r{t) by i?(s*,a) in (2.3). 
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In this paper we consider as optimality criterion the expected cumulative reward, where 
the expectation refers to the expectation over sequences of states and executed joint actions. 
The planning problem is to find a conditional plan, or policy, for each agent to maximize the 
optimality criterion. In the Dec-POMDP case this amounts to finding a tuple of policies, 
called a joint policy that maximizes the expected cumulative reward. 

Note that, in a Dec-POMDP, the agents are assumed not to observe the immediate 
rewards: observing the immediate rewards could convey information regarding the true state 
which is not present in the received observations, which is undesirable as all information 
available to the agents should be modeled in the observations. When planning for Dec- 
POMDPs the only thing that matters is the expectation of the cumulative future reward 
which is available in the off-line planning phase, not the actual reward obtained. Indeed, it 
is not even assumed that the actual reward can be observed at the end of the episode. 

Summarizing, in this work we consider Dec-POMDPs with finite actions and observation 
sets and a finite planning horizon. Furthermore, we consider the general Dec-POMDP set- 
ting, without any simplifying assumptions on the observation, transition, or reward models. 

2.2 Example: Decentralized Tiger Problem 

Here we will describe the decentralized tiger problem introduced by Nair et al. (2003b). 
This test problem has been frequently used (Nair et al., 2003b; Emery- Montemerlo et al., 
2004; Emery- Montemerlo, Gordon, Schneider, & Thrun, 2005; Szer et al., 2005) and is a 
modification of the (single-agent) tiger problem (Kaelbling et al., 1998). It concerns two 
agents that are standing in a hallway with two doors. Behind one of the doors is a tiger, 
behind the other a treasure. Therefore there are two states: the tiger is behind the left door 
(si) or behind the right door (sr)- Both agents have 3 actions at their disposal: open the 
left door (aoL)i open the right door (oor) and listen (oli). But they cannot observe each 
other's actions. In fact, they can only receive 2 observations. Either they hear a sound left 
(ohl) or right (ohr). 

At t = the state is si or Sr with probability 0.5. As long as no agent opens a door the 
state doesn't change, when a door is opened, the state resets to si or Sr with probability 0.5. 
The full transition, observation and reward model are listed by Nair et al. (2003b). The 
observation probabilities are independent, and identical for both agents. For instance, when 
the state is si and both perform action Cl;, both agents have a 85% chance of observing 
Ohl) and the probability of both hearing the tiger left is 0.85 • 0.85 = 0.72. 

When the agents open the door for the treasure they receive a positive reward, while 
they receive a penalty for opening the wrong door. When opening the wrong door jointly, 
the penalty is less severe. Opening the correct door jointly leads to a higher reward. 

Note that, when the wrong door is opened by one or both agents, they are attacked 
by the tiger and receive a penalty. However, neither of the agents observe this attack nor 
the penalty and the episode continues. Arguably, a more natural representation would be 
to have the episode end after a door is opened or to let the agents observe whether they 
encountered the tiger or treasure, however this is not considered in this test problem. 
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2.3 Histories 

As mentioned, the goal of planning in a Dec-POMDP is to find a (near-) optimal tuple 
of policies, and these policies specify for each agent how to act in a specific situation. 
Therefore, before we define a policy, we first need to define exactly what these specific 
situations are. In essence such situations are those parts of the history of the process that 
the agents can observe. 

Let us first consider what the history of the process is. A Dec-POMDP with horizon h 
specifies h time steps or stages t = 0,...,h — 1. At each of these stages, there is a state s*, 
joint observation o* and joint action a*. Therefore, when the agents will have to select 
their fc-th actions (at t = k — 1), the history of the process is a sequence of states, joint 
observations and joint actions, which has the following form: 

[s',o',a',s\o\a\...,s'~\o''-' 

Here is the initial state, drawn according to the initial state distribution b^. The initial 
joint observation is assumed to be the empty joint observation: 0*^ = 00 = (oi 0,...,o„ 0). 

From this history of the process, the states remain unobserved and agent i can only 
observe its own actions and observations. Therefore an agent will have to base its decision 
regarding which action to select on the sequence of actions and observations observed up 
to that point. 

Definition 2.3. We define the action- observation history for agent i, 6i, as the sequence 
of actions taken by and observations received by agent i. At a specific time step t, this is: 

at _ fj) „o 1 t-i t\ 
- [Oi,ai,Oi ...,a- ,Ojj . 

The joint action- observation history, 6, is the action-observation history for all agents: 

Agent i's set of possible action-observation histories at time t is 0* = Xt{Oi x Ai). The 
set of all possible action-observation histories for agent i is Qi = U^Zq^G*.^ Finally the set 
of all possible joint action-observation histories is given by = U^Jq^(0^ x ... x Q^)- 
t = 0, the action-observation history is empty, denoted by 0" = 00. 

We will also use a notion of history only using the observations of an agent. 

Definition 2.4. Formally, we define the observation history for agent i, Oi, as the sequence 
of observations an agent has received. At a specific time step t, this is: 



'-^c^,ol,...,o^. 



The joint observation history, o, is the observation history for all agents: 

o* = (oi*,...,o*). 



3. Note that in a particular Dec-POMDP, it may be the case that not all of these histories can actually be 
realized, because of the probabilities specified by the transition and observation model. 
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The set of observation histories for agent i at time t is denoted Oj = x^Oi. Similar to the 
notation for action-observation histories, we also use Oj and O and the empty observation 
history is denoted O0. 

Similarly we can define the action history as follows. 

Definition 2.5. The action history for agent i, Si, is the sequence of actions an agent has 
performed. At a specific time step t, we write: 

«/ = (a° A^ • • • >«*"^) • 

Notation for joint action histories and sets are analogous to those for observation histo- 
ries. Also write o^i,9^i, etc. to denote a tuple of observation-, action-observation histories, 
etc. for all agents except i. Finally we note that, clearly, an (joint) action-observation 
history consists of an (joint) action- and an (joint) observation history: 0* = {|o*,a*). 

2.4 Policies 

As discussed in the previous section, the action-observation history of an agent specifies 
all the information the agent has when it has to decide upon an action. For the moment 
we assume that an individual policy vTj for agent z is a deterministic mapping from action- 
observation sequences to actions. 

The number of possible action-observation histories is usually very large as this set 
grows exponentially with the horizon of the problem. At time step t, there are (|^i| • |Oi|)* 
action-observation histories for agent i. As a consequence there are a total of 

of such sequences for agent i. Therefore the number of policies for agent i becomes: 

\Ai\ (I^.Mo»l)-i , (2.4) 
which is doubly exponential in the horizon h. 

2.4.1 Pure and Stochastic Policies 

It is possible to reduce the number of policies under consideration by realizing that a lot 
of policies specify the same behavior. This is illustrated by the left side of Figure 2, which 
clearly shows that under a deterministic policy only a subset of possible action-observation 
histories are reached. Policies that only differ with respect to an action-observation history 
that is not reached in the first place, manifest the same behavior. The consequence is that 
in order to specify a deterministic policy, the observation history suffices: when an agent 
takes its action deterministically, he will be able to infer what action he took from only the 
observation history as illustrated by the right side of Figure 2. 

Definition 2.6. A pure or deterministic policy, vTj, for agent z in a Dec-POMDP is a 
mapping from observation histories to actions, iTi : Oi Ai. The set of pure policies of 
agent i is denoted llj. 
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Figure 2: A deterministic policy can be represented as a tree. Left: a tree of action- 
observation histories Oi for one of the agents from the Dec-Tiger problem. An 
arbitrary deterministic policy vrj is highlighted. Clearly shown is that tTj only 
reaches a subset of of histories 9i. {Oi that are not reached are not further ex- 
panded.) Right: The same policy can be shown in a simplified policy tree. 



Note that also for pure policies we sometimes write vrj(^j). In this case we mean the 
action that vTj specifies for the observation history contained in 9i. For instance, let Oi = 
{oi,ai), then 7rj(6'i) = iTi{oi). We use vr = (7ri,...,7r„) to denote a joint policy, a profile 
specifying a policy for each agent. We say that a pure joint policy is an induced or implicit 
mapping from joint observation histories to joint actions ir : O A. That is, the mapping 
is induced by individual policies vrj that make up the joint policy. Also we use Tr^j = 
(vTi, . . . ,7rj_i,7rj-|_i, . . . ,7r„), to denote a profile of policies for all agents but i. 

Apart from pure policies, it is also possible to have the agents execute randomized 
policies, i.e., policies that do not always specify the same action for the same situation, but 
in which there is an element of chance that decides which action is performed. There are 
two types of randomized policies: mixed policies and stochastic policies. 

Definition 2.7. A mixed policy, fii, for an agent z is a set of pure policies, M C Ilj, along 
with a probability distribution over this set. Thus a mixed policy //j G V{A4) is an element 
of the set of probability distributions over A4. 

Definition 2.8. A stochastic or behavioral policy, for agent z is a mapping from action- 
observation histories to probability distributions over actions, : 0j — )• V{Ai). 

When considering stochastic policies, keeping track of only the observations is insuffi- 
cient, as in general all action-observation histories can be realized. That is why stochastic 
policies are defined as a mapping from the full space of action-observation histories to prob- 
ability distributions over actions. Note that we use vTj and Ilj to denote a policy (space) in 
general, so also for randomized policies. We will only use vTj, /Xj and q when there is a need 
to discriminate between different types of policies. 
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A common way to represent the temporal structure in a policy is to split it in decision 
rules 5i that specify the policy for each stage. An individual policy is then represented as 
a sequence of decision rules vrj = {6^, . . . ,d'l'~^). In case of a deterministic policy, the form 
of the decision rule for stage t is a mapping from length-t observation histories to actions 
6j:di^ A. 

2.4.2 Special Cases with Simpler Policies. 

There are some special cases of Dec-POMDPs in which the policy can be specified in a 
simpler way. Here we will treat three such cases: in case the state s is observable, in 
the single-agent case and the case that combines the previous two: a single agent in an 
environment of which it can observe the state. 

The last case, a single agent in a fully observable environment, corresponds to the regular 
MDP setting. Because the agent can observe the state, which is Markovian, the agent does 
not need to remember any history, but can simply specify the decision rules 6 of its policy 
vr = ((5'^, . . . as mappings from states to actions: Vi 5* : 5 — )■ The complexity of 

the policy representation reduces even further in the infinite-horizon case, where an optimal 
policy vr* is known to be stationary. As such, there is only one decision rule 5, that is used 
for all stages. 

The same is true for multiple agents that can observe the state, i.e., a fully observable 
Dec-POMDP as defined in Section 2.1. This is essentially the same setting as the multiagent 
Markov decision process (MMDP) introduced by Boutilier (1996). In this case, the decision 
rules for agent i's policy are mappings from states to actions Vt 5* : S ^ Ai, although 
in this case some care needs to be taken to make sure no coordination errors occur when 
searching for these individual policies. 

In a POMDP, a Dec-POMDP with a single agent, the agent cannot observe the state, 
so it is not possible to specify a policy as a mapping from states to actions. However, it 
turns out that maintaining a probability distribution over states, called belief, b G Vi^S), is 
a Markovian signal: 

P{s'+'\a\o\a'~\o'-\. . . ,aO,oO) = P(s*+i|6*,a*), 

where the belief 6* is defined as 

b\s') = P{s'\o\a'-\o'-\ . . . ,a°,oO) = P(s*|6*-\a*-\o*). 

As a result, a single agent in a partially observable environment can specify its policy as a 
series of mappings from the set of beliefs to actions 6^ : V{S) A. 

Unfortunately, in the general case we consider, no such space-saving simplifications of 
the policy are possible. Even though the transition and observation model can be used 
to compute a joint belief, this computation requires knowledge of the joint actions and 
observations. During execution, the agents simply have no access to this information and 
thus can not compute a joint belief. 

2.4.3 The Quality of Joint Policies 

Clearly, policies differ in how much reward they can expect to accumulate, which will serve 
as a criterion of a joint policy's quality. Formally, we consider the expected cumulative 
reward of a joint policy, also referred to as its value. 
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Definition 2.9. The value V{tt) of a joint policy vr is defined as 

h-l 

t=o 

where the expectation is over states, observations and — in the case of a randomized vr — 
actions. 

In particular we can calculate this expectation as 

= E E E E Ris\a')PAa'\e% (2.6) 

where P7r(a*|^*) is the probability of a as specified by vr, and where P(s*,^*|7r,6°) is recur- 
sively defined as 

P(s*/*|7r/)= E P{s\0'\s'-\e'-\Tr)P{s''\d'-^\TT,b°), (2.7) 

with 

P(s*,^*|s*-i/*-i,7r) = P(o*|a*-\s*)P(s*|s*-\a*-^)P^(a*-^|^*-i) (2.8) 

a term that is completely specified by the transition and observation model and the joint 
policy. For stage we have that P{s^ ,9(i,\TT,b^) = 6''(s°). 

Because of the recursive nature of P{s^ ,9^\ir,b^) it is more intuitive to specify the value 
recursively: 

a^eA 

(2.9) 

with 6*'^^ = {9^,a* ,o^~^^). The value of joint policy vr is then given by 

V{7r) = V^{s'Mb^{s^). (2.10) 

For the special case of evaluating a pure joint policy vr, eq. (2.6) can be written as: 

h~l 

= E E P{0'\7r,b^)R{e'M9*)), (2.11) 

where 

R{e\a') = Y R{s\a')P{s'\e\b'^) (2.12) 

denotes the expected immediate reward. In this case, the recursive formulation (2.9) reduces 
to 

l/*(s*,o*) = P(s*,vr(o*)) + E P{s'^\o'+^\s\TT{d'))V^+\s'+\d'+^). (2.13) 



vr,6^ 



(2.5) 



R{s\a')+ Y E Pis'^\o'+^\s\a')V^{s'+\e'+^) 
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Note that, when performing the computation of the value for a joint pohcy recursively, 
intermediate results should be cached. A particular (s*+^,o*+^)-pair (or (s*+^,0*"^^)-pair for 
a stochastic joint policy) can be reached from |5| states s* of the previous stage. The value 
y^+i^^t+i^^t+i^ is the same, however, and should be computed only once. 

2.4.4 Existence of an Optimal Pure Joint Policy 

Although randomized policies may be useful, we can restrict our attention to pure policies 
without sacrificing optimality, as shown by the following. 

Proposition 2.1. A Dec-POMDP has at least one optimal pure joint policy. 

Proof. See appendix A.l. □ 

3. Overview of Dec-POMDP Solution Methods 

In order to provide some background on solving Dec-POMDPs, this section gives an overview 
of some recently proposed methods. We will limit this review to a number of finite-horizon 
methods for general Dec-POMDPs that are related to our own approach. 

We will not review the work performed on infinite-horizon Dec-POMDPs, such as the 
work by Peshkin, Kim, Meuleau, and Kaelbling (2000), Bernstein, Hansen, and Zilberstein 
(2005), Szer and Charpillet (2005), Amato, Bernstein, and Zilberstein (2006, 2007a). In this 
setting policies are usually represented by finite state controllers (FSCs). Since an infinite- 
horizon Dec-POMDP is undecidable (Bernstein et al., 2002), this line of work, focuses on 
finding e-approximate solutions (Bernstein, 2005) or (near-) optimal policies for given a 
particular controller size. 

There also is a substantial amount of work on methods exploiting particular inde- 
pendence assumptions. In particular, transition and observation independent Dec-MDPs 
(Becker et al., 2004b; Wu & Durfee, 2006) and Dec-POMDPs (Kim, Nair, Varakantham, 
Tambe, & Yokoo, 2006; Varakantham et al., 2007) have received quite some attention. 
These models assume that each agent i has an individual state space Si and that the ac- 
tions of one agent do not infiuence the transitions between the local states of another agent. 
Although such models are easier to solve, the independence assumptions severely restrict 
their applicability. Other special cases that have been considered are, for instance, goal 
oriented Dec-POMDPs (Goldman & Zilberstein, 2004), event-driven Dec-MDPs (Becker, 
Zilberstein, &; Lesser, 2004a), Dec-MDPs with time and resource constraints (Beynier & 
Mouaddib, 2005, 2006; Marecki & Tambe, 2007), Dec-MDPs with local interactions (Spaan 
&; Melo, 2008) and factored Dec-POMDPs with additive rewards (Oliehoek, Spaan, White- 
son, k Vlassis, 2008). 

A final body of related work which is beyond the scope of this article are models and 
techniques for explicit communication in Dec-POMDP settings (Ooi &: Wornell, 1996; Py- 
nadath & Tambe, 2002; Goldman & Zilberstein, 2003; Nair, Roth, & Yohoo, 2004; Becker, 
Lesser, & Zilberstein, 2005; Roth, Simmons, & Veloso, 2005; Oliehoek, Spaan, & Vlassis, 
2007b; Roth, Simmons, & Veloso, 2007; Goldman, Allen, & Zilberstein, 2007). The Dec- 
POMDP model itself can model communication actions as regular actions, in which case the 
semantics of the communication actions becomes part of the optimization problem (Xuan, 
Lesser, & Zilberstein, 2001; Goldman & Zilberstein, 2003; Spaan, Gordon, & Vlassis, 2006). 
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In contrast, most approaches mentioned typically assume that communication happens out- 
side the Dec-POMDP model and with pre-defined semantics. A typical assumption is that 
at every time step the agents communicate their individual observations before selecting an 
action. Pynadath and Tambe (2002) showed that, under assumptions of instantaneous and 
cost-free communication, sharing individual observations in such a way is optimal. 



3.1 Brute Force Policy Evaluation 

Because there exists an optimal pure joint policy for a finite-horizon Dec-POMDP, it is in 
theory possible to enumerate all different pure joint policies, evaluate them using equations 
(2.10) and (2.13) and choose the best one. The number of pure joint policies to be evaluated 
is: 



n(| 








o. 


-1 



(3.1) 

where \A^\ and jO*! denote the largest individual action and observation sets. The cost 
of evaluating each policy is O (|5| • The resulting total cost of brute-force policy 

evaluation is 

OMAI x|5|xja|"M, (3.2) 

which is doubly exponential in the horizon h. 



3.2 Alternating Maximization 

Nair et al. (2003b) introduced Joint Equilibrium based Search for Policies (JESP). This 
method guarantees to find a locally optimal joint policy, more specifically, a Nash equilib- 
rium: a tuple of policies such that for each agent i its policy vTj is a best response for the 
policies employed by the other agents tt^j. It relies on a process we refer to as alternating 
maximization. This is a procedure that computes a policy VTj for an agent i that maximizes 
the joint reward, while keeping the policies of the other agents fixed. Next, another agent is 
chosen to maximize the joint reward by finding its best-response to the fixed policies of the 
other agents. This process is repeated until the joint policy converges to a Nash equilibrium, 
which is a local optimum. The main idea of fixing some agents and having others improve 
their policy was presented before by Chades, Scherrer, and Charpillet (2002), but they used 
a heuristic approach for memory- less agents. The process of alternating maximization is 
also referred to as hill-climbing or coordinate ascent. 

Nair et al. (2003b) describe two variants of JESP, the first of which, Exhaustive-JESP, 
implements the above idea in a very straightforward fashion: Starting from a random joint 
policy, the first agent is chosen. This agent then selects its best-response policy by evaluating 
the joint reward obtained for all of its individual policies when the other agents follow their 
fixed policy. 

The second variant, DP- JESP, uses a dynamic programming approach to compute the 
best-response policy for a selected agent i. In essence, fixing the policies of all other agents 
allows for a reformulation of the problem as an augmented POMDP. In this augmented 
POMDP a state s = {s,o,-) consists of a nominal state s and the observation histories of 
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the other agents ol^j. Given the fixed deterministic pohcies of other agents vr^j, such an 
augmented state s is a Markovian state, and all transition and observation probabilities can 
easily be derived from Tr^j. 

Like most methods proposed for Dec-POMDPs, JESP exploits the knowledge of the 
initial belief by only considering reachable beliefs h{s) in the solution of the POMDP. 
However, in some cases the initial belief might not be available. As demonstrated by 
Varakantham, Nair, Tambe, and Yokoo (2006), JESP can be extended to plan for the entire 
space of initial beliefs, overcoming this problem. 

3.3 MAA* 

Szer et al. (2005) introduced a heuristically guided policy search method called multiagent 
A* (MAA*). It performs a guided A*-like search over partially specified joint policies, 
pruning joint policies that are guaranteed to be worse than the best (fully specified) joint 
policy found so far by an admissible heuristic. 

In particular MAA* considers joint policies that are partially specified with respect 
to time: a partial joint policy = ((5'',(^^, . . . ,(5*~^) specifies the joint decision rules for 
the first t stages. For such a partial joint policy (/?* a heuristic value V{ip^) is calculated 
by taking y'^"-*~^((^*), the actual expected reward achieves over the first t stages, and 
adding y^—^-'^^ a heuristic value for the remaining h — t stages. Clearly when y^---^-^ is an 
admissible heuristic — a guaranteed overestimation — so is ^(y?*). 

MAA* starts by placing the completely unspecified joint policy ^p^ in an open list. 
Then, it proceeds by selecting partial joint policies = (5*^,5^, . . . ,(^*~^) from the list and 
'expanding' them: generating all 99*^^ = (5*^,5^, . . . ,(^*~^,(^*) by appending all possible joint 
decision rules (5* for next time step (t). The left side of Figure (3) illustrates the expansion 
process. After expansion, all created children are heuristically valuated and placed in the 
open list, any partial joint policies c/?*"'"^ with V{^p^~^^) less than the expected value V^(vr) of 
some earlier found (fully specified) joint policy tt, can be pruned. The search ends when the 
list becomes empty, at which point we have found an optimal fully specified joint policy. 

3.4 Dynamic Programming for Dec-POMDPs 

MAA* incrementally builds policies from the first stage t = to the last t = h — 1. Prior to 
this work, Hansen et al. (2004) introduced dynamic programming (DP) for Dec-POMDPs, 
which constructs policies the other way around: starting with a set of '1-step policies' 
(actions) that can be executed at the last stage, they construct a set of 2-step policies to 
be executed at /i — 2, etc. 

It should be stressed that the policies maintained are quite different from those used 
by MAA*. In particular a partial policy in MAA* has the form 99* = (6^,6^, . . . ,5*"^). The 
policies maintained by DP do not have such a correspondence to decision rules. We define 
the time-to-go t at stage t as 

T = h-t. (3.3) 

Now qj^^ denotes a fc-steps-to-go sub-tree policy for agent i. That is, qj^'' is a policy 
tree that has the same form as a full policy for the horizon-A; problem. Within the original 
horizon-/i problem qj^'' is a candidate for execution starting at stage t = h — k. The set of k- 
steps-to-go sub-tree policies maintained for agent i is denoted QJ^'' ■ Dynamic programming 
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Figure 3: Difference between poiicy construction in MAA* (ieft) and dynamic programming 
(riglit) for an agent with actions a,a and observations 0,0. The dashed compo- 
nents are newly generated, dotted components result from the previous iteration. 
MAA* 'expands' a partial policy from the leaves, while dynamic programming 
backs up a set of 'sub-tree policies' forming new ones. 



for Dec-POMDPs is based on backup operations: constructing QJ~ a set of sub-tree 
policies ql~^^^ from a set Ql^^ ■ For instance, the right side of Figure 3 shows how ql^'^, a 
3-steps-to-go sub-tree policy, is constructed from two qj^"^ E QI^^- Also illustrated is the 
difference between this process and MAA* expansion (on the left side). 

Dynamic programming consecutively constructs Ql^^ ,0,1^"^ , ■ ■ ■ ^Ql^^ for all agents i. 
However, the size of the set Q^^~^~^^ is given by 

|gr=fc+l| ^ |^^||gr=fc|10,|^ 

and as a result the sizes of the maintained sets grow doubly exponential with k. To counter 
this source of intractability, Hansen et al. (2004) propose to eliminate dominated sub-tree 
policies. The expected reward of a particular sub-tree policy qj^^ depends on the probability 
over states when ql^^ is started (at stage t = h — k) as well as the probability with which 
the other agents j ^ i select their sub-tree policies q'j^'' G SJ^'^' denote a 

sub-tree profile for all agents but z, and Qp"'^ the set of such profiles, we can say that ql^^ 
is dominated if it is not maximizing at any point in the multiagent belief space: the simplex 
over S X QT^^- Hansen et al. test for dominance over the entire multiagent belief space by 
linear programming. Removal of a dominated sub-tree policy qj^^ of an agent i may cause 
a sub-tree policy q'j^^ of an other agent j to become dominated. Therefore Hansen et al. 
propose to iterate over agents until no further pruning is possible, a procedure known as 
iterated elimination of dominated policies (Osborne &: Rubinstein, 1994). 

Finally, when the last backup step is completed the optimal policy can be found by 
evaluating all joint policies vr G Q\^^ x • • • x for the initial belief . 
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3.5 Extensions on DP for Dec-POMDPs 

In the last few years several extensions to the dynamic programming algorithm for Dec- 
POMDPs have been proposed. The first of these extensions is due to Szer and Charpillet 
(2006). Rather than testing for dominance over the entire multiagent belief space, Szer 
and Charpillet propose to perform point-based dynamic programming (PBDP). In order 
to prune the set of sub-tree policies QJ^'^, the set of all the belief points Sj^reachabic C 
X Qpy'^) that can possibly be reached by deterministic joint policies are generated. 
Only the sub-tree policies qj^'' that maximize the value at some bi € ^i,reachabie are kept. 
The proposed algorithm is optimal, but intractable because it needs to generate all the 
multiagent belief points that are reachable through all joint policies. To overcome this 
bottleneck, Szer and Charpillet propose to randomly sample one or more joint policies and 
use those to generate -Si,reachabie- 

Seuken and Zilberstein (2007b) also proposed a point-based extension of the DP al- 
gorithm, called memory-bounded dynamic programming (MBDP). Rather than using a 
randomly selected policy to generate the belief points, they propose to use heuristic poli- 
cies. A more important difference, however, lies in the pruning step. Rather than pruning 
dominated sub-tree policies qj^^, MBDP prunes all sub-tree policies except a few in each 
iteration. More specifically, for each agent maxTrees sub-tree policies are retained, which 
is a parameter of the planning method. As a result, MBDP has only linear space and time 
complexity with respect to the horizon. The MBDP algorithm still depends on the exhaus- 
tive generation of the sets Ql~^^^ which now contain maxTrees l'^*' sub-tree policies. 
Moreover, in each iteration all [\A^,\ maxTrees I '-^* I) joint sub-tree policies have to be eval- 
uated for each of the sampled belief points. To counter this growth, Seuken and Zilberstein 
(2007a) proposed an extension that limits the considered observations during the backup 
step to the maxOhs most likely observations. 

Finally, a further extension of the DP for Dec-POMDPs algorithm is given by Amato, 
Carlin, and Zilberstein (2007b). Their approach, bounded DP (BDP), establishes a bound 
not on the used memory, but on the quality of approximation. In particular, BDP uses 
e-pruning in each iteration. That is, a q]^^ that is maximizing in some region of the 
multiagent belief space, but improves the value in this region by at most e, is also pruned. 
Because iterated elimination using e- pruning can still lead to an unbounded reduction in 
value, Amato et al. propose to perform one iteration of e-pruning, followed by iterated 
elimination using normal pruning. 

3.6 Other Approaches for Finite-Horizon Dec-POMDPs 

There are a few other approaches for finite-horizon Dec-POMDPs, which we will only briefly 
describe here. Aras, Dutech, and Charpillet (2007) proposed a mixed integer linear pro- 
gramming formulation for the optimal solution of finite-horizon Dec-POMDPs. Their ap- 
proach is based on representing the set of possible policies for each agent in sequence form 
(Romanovskii, 1962; Koher, Megiddo, & von Stengel, 1994; Koller & Pfeffer, 1997). In se- 
quence form, a single policy for an agent i is represented as a subset of the set of 'sequences' 
(roughly corresponding to action-observation histories) for the agent. As such the problem 
can be interpreted as a combinatorial optimization problem, which Aras et al. propose to 
solve with a mixed integer linear program. 
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Oliehoek, Kooij, and Vlassis (2007a) also recognize that finding a solution for Dec- 
POMDPs in essence is a combinatorial optimization problem and propose to apply the 
Cross-Entropy method (de Boer, Kroese, Mannor, &; Rubinstein, 2005), a method for com- 
binatorial optimization that recently has become popular because of its ability to find 
near-optimal solutions in large optimization problems. The resulting algorithm performs a 
sampling-based policy search for approximately solving Dec-POMDPs. It operates by sam- 
pling pure policies from an appropriately parameterized stochastic policy, and then evaluates 
these policies either exactly or approximately in order to define the next stochastic policy 
to sample from, and so on until convergence. 

Finally, Emery-Montemerlo et al. (2004, 2005) proposed to approximate Dec-POMDPs 
through series of Bayesian games. Since our work in this article is based on the same 
representation, we defer a detailed explanation to the next section. We do mention here 
that while Emery-Montemerlo et al. assume that the algorithm is run on-line (interleaving 
planning and execution), no such assumption is necessary. Rather we will apply the same 
framework during a off-line planning phase, just like the other algorithms covered in this 
overview. 

4. Optimal Q-value Functions 

In this section we will show how a Dec-POMDP can be modeled as a series of Bayesian 
games (BGs). A BG is a game-theoretic model that can deal with uncertainty (Osborne 
& Rubinstein, 1994). Bayesian games are similar to the more well-known normal form, or 
matrix games, but allow to model agents that have some private information. This section 
will introduce Bayesian games and show how a Dec-POMDP can be modeled as a series 
of Bayesian games (BGs). This idea of using a series of BGs to find policies for a Dec- 
POMDP has been proposed in an approximate setting by Emery-Montemerlo et al. (2004). 
In particular, they showed that using series of BGs and an approximate payoff function, 
they were able to obtain approximate solutions on the Dec-Tiger problem, comparable to 
results for JESP (see Section 3.2). 

The main result of this section is that an optimal Dec-POMDP policy can be computed 
from the solution of a sequence of Bayesian games, if the payoff function of those games 
coincides with the Q-value function of an optimal policy vr*, i.e., with the optimal Q- 
value function Q* . Thus, we extend the results of Emery-Montemerlo et al. (2004) to 
include the optimal setting. Also, we conjecture that this form of Q* can not be computed 
without already knowing an optimal policy vr* . By transferring the game-theoretic concept 
of sequential rationality to Dec-POMDPs, we find a description of Q* that is computable 
without knowing vr* up front. 

4.1 Game-Theoretic Background 

Before we can explain how Dec-POMDPs can be modeled using Bayesian games, we will 
first introduce them together with some other necessary game theoretic background. 
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Figure 4: Left: The game 'Chicken'. Both players have the option to (D)rive on or (C)hicken 
out. Right: The meeting location problem. Because the game has identical 
payoffs, each entry contains just one number. 



4.1.1 Strategic Form Games and Nash Equilibria 

At the basis of the concept of a Bayesian game lies a simpler form of game: the strategic- or 
normal form game. A strategic game consists of a set of agents or players, each of which has 
a set of actions (or strategies). The combination of selected actions specifies a particular 
outcome. When a strategic game consists of two agents, it can be visualized as a matrix 
as shown in Figure 4. The first game shown is called 'Chicken' and involves two teenagers 
who are driving head on. Both have the option to drive on or chicken out. Each teenager's 
payoff is maximal (+2) when he drives on and his opponent chickens out. However, if both 
drive on, a collision follows giving both a payoff of —1. The second game is the meeting 
location problem. Both agents want to meet in location A or B. They have no preference 
over which location, as long as both pick the same location. This game is fully cooperative, 
which is modeled by the fact that the agents receive identical payoffs. 

Definition 4.1. Formally, a strategic game is a tuple {n,A,u), where n is the number of 
agents, A = XiAi is the set of joint actions, and u = (ui, . . . with : ^ — > M is the 
payoff function of agent i. 

Game theory tries to specify for each agent how to play. That is, a game-theoretic 
solution should suggest a policy for each agent. In a strategic game we write Oi to denote 
a policy for agent i and a for a joint policy. A policy for agent i is simply one of its actions 
Oi = ai £ Ai (i.e., a pure policy), or a probability distribution over its actions Oi € V{Ai) 
(i.e., a mixed policy). Also, the policy suggested to each agent should be rational given 
the policies suggested to the other agent; it would be undesirable to suggest a particular 
policy to an agent, if it can get a better payoff by switching to another policy. Rather, the 
suggested policies should form an equilibrium, meaning that it is not profitable for an agent 
to unilaterally deviate from its suggested policy. This notion is formalized by the concept 
of Nash equilibrium. 

Definition 4.2. A pure policy profile a = (ai, . . . . . . ,a„) specifying a pure policy for 

each agent is a Nash Equilibrium (NE) if and only if 



■Uj({ai, ... ,ai 



■ ■ ■ ,an)) > Ui{{ai, . . . ,a-, . . . 





This definition can be easily extended to incorporate mixed policies by defining 



n 




(ai,...„a„) 



i=l 
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Nash (1950) proved that when allowing mixed policies, every (finite) strategic game contains 
at least one NE, making it a proper solution for a game. However, it is unclear how such a 
NE should be found. In particular, there may be multiple NEs in a game, making it unclear 
which one to select. In order to make some discrimination between Nash equilibria, we can 
consider NEs such that there is no other NE that is better for everyone. 

Definition 4.3. A Nash Equilibrium a = (ai, . . . ,aj, . . . ,0^) is referred to as Pareto Opti- 
mal (PO) when there is no other NE a' that specifies at least the same payoff for all agents 
and a higher payoff for at least one agent: 



In the case when multiple Pareto optimal Nash equilibria exist, the agents can agree 
beforehand on a particular ordering, to ensure the same NE is chosen. 

4.1.2 Bayesian Games 

A Bayesian game (Osborne & Rubinstein, 1994) is an augmented normal form game in 
which the players hold some private information. This private information defines the type 
of the agent, i.e., a particular type 9i € Qi of an agent i corresponds to that agent knowing 
some particular information. The payoff the agents receive now no longer only depends on 
their actions, but also on their private information. Formally, a BG is defined as follows: 

Definition 4.4. A Bayesian game (BG) is a tuple (n,^,0,P(0), {ui,...Un)), where n is the 
number of agents, A is the set of joint actions, G = Xj0j is the set of joint types over 
which a probability function P{Q) is specified, and : x ^ ^ M is the payoff function 
of agent i. 

In a normal form game the agents select an action. Now, in a BG the agents can 
condition their action on their private information. This means that in BGs the agents 
use a different type of policies. For a BG, we denote a joint policy /? = (/3i,...,/3„), where 
the individual policies are mappings from types to actions: /3i : 0j — )> Ai. In the case of 
identical payoffs for the agents, the solution of a BG is given by the following theorem: 

Theorem 4.1. For a BG with identical payoffs, i.e., '^ij'ie^a Ui{6,a) = Uj{9,a), the solution 
is given by: 



where j3{9) = (/3i(0i),...,/3„(0„)) is the joint action specified by f3 for joint type 6. This 
solution constitutes a Pareto optimal Nash equilibrium. 

Proof. The proof consists of two parts: the first shows that f3* is a Nash equilibrium, the 
second shows it is Pareto optimal. 

Nash equilibrium proof. It is clear that /3* satisfying 4.2 is a Nash equilibrium by 
rewriting from the perspective of an arbitrary agent i as follows: 



$a' (Vi Ui{a') > Ui{a) A 3j Ui{a') > Ui{a)) . 




(4.2) 
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at 



arg max 

ft 



arg max 

Pi 



arg max 



max^P(0)n(0,/3(0)) 



max 



u{9,m) 



= argmaxJ^P(^i)j;P(e^,|0,)^/((0i,e^,),(A(^i),/3;i(0^.)», 

which means that (3* is a best response for Since no special assumptions were made 
on i, it follows that /?* is a Nash equilibrium. 

Pareto optimality proof. Let us write Vg. (aj,/3^j) for the payoff agent i expects for 9i 
when performing a, when the other agents use policy profile /3^j. We have that 

Now, a joint policy /?* satisfying (4.2) is not Pareto optimal if and only if there is another 
Nash equilibrium /?' that attains at least the same payoff for all agents i and for all types 9i 
and strictly more for at least one agent and type. Formally (3* is not Pareto optimal when 
3/3' such that: 

V.Ve, VeMiO^),l3*^,)<Ve,{(3i'{e,),/3^/) A 3,3eMP:{e,),f3*^i) < VeM'{0^),/3^^'). (4.3) 

We prove that no such f3' can exist by contradiction. Suppose that f3' = {I3i' is a 
NE such that (4.3) holds (and thus /3* is not Pareto optimal). Because /3* satisfies (4.2) we 
know that: 

p{e)u{erm > E P{9)u{e,f3'{9)), (4.4) 

and therefore, for all agents i 
P{e,,l)Ve,^MiO^,l),/3*^^) + ... + P{e,,le.l)Ve^,e^K{0^,m),^3*^i) > 

P{e^,l)Ve,Mi^i^)^f^'^i) + - + m,i0d)^^^.ie,|(/3K^.,i0.i)'/?;^.) 

holds. However, by assumption that (3' satisfies (4.3) we get that 

3i Ve„,(/3;(0ij),/3;j < Ve^^^{l3'M^,W^^). 
Therefore it must be that 

k^j k^j 
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Figure 5: A Dec-POMDP can be seen as a tree of joint actions and observations. The 
indicated planes correspond with the Bayesian games for the first two stages. 

and thus that 

contradicting the assumption that /3' satisfies (4.3). □ 
4.2 Modeling Dec-POMDPs with Series of Bayesian Games 

Now we will discuss how Bayesian games can be used to model Dec-POMDPs. Essentially, 
a Dec-POMDP can be seen as a tree where nodes are joint action-observation histories 
and edges represent joint actions and observations, as illustrated in Figure 5. At a specific 
stage t in a Dec-POMDP, the main difficulty in coordinating action selection is presented 
by the fact that each agent has its own individual action-observation history. That is, there 
is no global signal that the agents can use to coordinate their actions. This situation can 
be conveniently modeled by a Bayesian game as we will now discuss. 

At a time step t, one can directly associate the primitives of a Dec-POMDP with those 
of a BG with identical payoffs: the actions of the agents are the same in both cases, and 
the types of agent i correspond to its action-observation histories Gj = Figure 6 shows 
the Bayesian games for t = and t = 1 for a fictitious Dec-POMDP with 2 agents. 

We denote the payoff function of the BG that models a stage of a Dec-POMDP by 
(5(^*,a). This payoff function should be naturally defined in accordance with the value 
function of the planning task. For instance, Emery- Montemerlo et al. (2004) define Q{6^,a) 
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as the Q]y[£)p-value of the underlying MDP. We will more extensively discuss the payoff 
function in Section 4.3. 

The probability P{9) is equal to the probability of the joint action-observation history 
to which 6 corresponds and depends on the past joint policy y^* = {5^ , . . . and the 

initial state distribution. It can be calculated as the marginal of (2.7): 

p{e) = p{e'\ip\}P) = P{s\e*\ip\iP). (4.5) 



When only considering pure joint policies if^ , the action probability component Pip{a\6) 
in (2.7) is 1 for joint action-observation histories 0* that are 'consistent' with the past joint 
policy 93* and otherwise. We say that an action-observation 9i history is consistent with 
a pure policy vTj if it can occur when executing tTj, i.e., when the actions in 9i would be 
selected by VTj. Let us more formally define this consistency as follows. 

Definition 4.5 (Consistency). Let us write 0/' for the restriction of 0/ to stage 0, . . . ,t' 
(with < t' < t). An action-observation history 0^ of agent i is consistent with a pure 
policy TTj if and only if at each time step t' with < t' < t 

= Mof) = 4 

is the (t' + l)-th action in 6^. A joint action-observation history 6^ = {O^, . . . fi^ is con- 
sistent with a pure joint policy vr = (vri, . . . ,7r„) if each individual 6^ is consistent with the 
corresponding individual policy vTj. C is the indicator function for consistency. For instance 
C(0*,7r) 'filters out' the action-observation histories 9^ that are inconsistent with a joint 
pure policy vr: 

I , otherwise. 

We will also write = {9^ \ C{9*,ir) = 1} for the set of 9^ consistent with vr. 



This definition allows us to write 



P{9'\ip\b°) = C{9\ip') P(s*,^*|6°) (4.7) 

with 

P(s*/*|6°) = Yl P{o^\a^~\s^)P{s^\s^~\a^~^)P{s^~\9^~^\b^). (4.8) 

Figure 6 illustrates how the indicator function 'filters out' policies, when tt^^^{9^^^) = 
(01,02), only the non-shaded part of the BG for t = 1 'can be reached' (has positive proba- 
bility). 
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Figure 6: The Bayesian game for the first and second time step (top: t = 0, bottom: t = 1). 

The entries 0*, a* are given by the payoff function Q(0*,a*). Light shaded entries 
indicate the solutions. Dark entries wih not be reahzed given (01,02) the solution 
of the BG for t = 0. 



4.3 The Q-value Function of an Optimal Joint Policy 

Given the perspective of a Dec-POMDP interpreted as a series of BGs as outlined in the 
previous section, the solution of the BG for stage t is a joint decision rule 5*. If the payoff 
function for the BG is chosen well, the quality of 5* should be high. Emery-Montemerlo 
et al. (2004) try to find a good joint policy vr = {6^, . . . ,6^~^) by a procedure we refer to 
as forward- sweep policy computation (FSPC): in one sweep forward through time, the BG 
for each stage t = 0,1, . . . ,/i — 1 is consecutively solved. As such, the payoff function for the 
BGs constitute what we call a Q-value function for the Dec-POMDP. 

Here, we show that there is an optimal Q-value function Q*: when using this Q* as 
the payoff functions for the BGs, forward-sweep policy computation will lead to an optimal 
joint policy vr* = (5°'*, . . . ,5^-1'*). We first give a derivation of this Q* . Next, we will 
discuss that Q* can indeed be used to calculate vr*, but computing Q* seems impractical 
without already knowing an optimal joint policy vr*. This issue will be further addressed in 
Section 4.4. 

4.3.1 Existence of Q* 

We now state a theorem identifying a normative description of Q* as the Q-value function 
for an optimal joint policy. 

Theorem 4.2. The expected cumulative reward over stages t, . . . ,h — 1 induced by it* , an 
optimal joint policy for a Dec-POMDP, is given by: 

v\tt*)= p{e'\b^)Q*{e\TT*{d')), (4.9) 
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where 6^ = (o*,a*), where tt*{6^) = tt*{o*) denotes the joint action that pure joint policy 
TT* specifies for o*, and where 

Q*{e\a) = R0\a)+ P{o'+^\e\a)Q*{e'^\n*0'+^)) (4.10) 

is the Q-value function for vr*, which gives the expected cumulative future reward when 
taking joint action a at 9^ given that an optimal joint policy vr* is followed hereafter. 

Proof. By filling out (2.11) for an optimal pure joint policy vr*, we obtain its expected 
cumulative reward as the summation of E [i?(s*,a*)|7r*] the expected rewards it yields for 
each time step: 

h-l h-l 

V{7:*) = Y,E[R{s\a')\TT*] =Y, P0'\T^\b^)R0' ^ 0')). (4.11) 

In this equation, P(6^\tt* ,b^) is given by (4.7). As a result, the influence of vr* on P{9^\7r* ,b^) 
is only through C. I.e., vr* is only used to 'filter out' inconsistent histories. Therefore we 
can write: 

E [R{s\a')\Tr*] = ^ P0'\b^)R0' ,tt* 0')), (4.12) 

where P(0*|6^) is given by directly taking the marginal of (4.8). Now, let us define the 
value starting from time step t: 

V\Tr*) = E [R{s\a')\7r*] + V'+^ir*) = P0'\b°)R0\7r* 0')) + V'+^tt*). (4.13) 

For the last time step h — 1 there is no expected future reward, so we get: 

y'^-i(7r*)= y P0''~^\b^)R0''"\7r*0''~^)). (4.14) 

y GU^, Q*(6l''-i,7r*(6l''-i)) 

For time step h — 2 this becomes: 



V^~^(7r*) = E 



P0^~^\b^)R0^~\TT*0^~^))+ P0^~^\h^)Q*0^~^,T^*0''~^))- 

TT TT 

(4.15) 

Because P0^'''^) = P0^-^)P{o^-^^e^-^,Tr*0''-'^)), (4.15) can be rewritten to: 

y'^-2^7r*) = Y P0''~^\b^)Q*0^~^,TT*0^~'^)), (4.16) 
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with 

P{o''~^\6''~^y{9^-^))Q*{6''-\TT*{e''-^)). (4.17) 

Reasoning in the same way we see that (4.9) and (4.10) constitute a generic expression for 
the expected cumulative future reward starting from time step t. □ 

Note that in the above derivation, we exphcitly included as one of the given arguments. 
In the rest of this text, we will always assume is given and therefore omit it, unless 
necessary. 

4.3.2 Deriving an Optimal Joint Policy from Q* 

At this point we have derived Q* , a Q- value function for an optimal joint policy. Now, we 
extend the results of Emery-Montemerlo et al. (2004) into the exact setting: 

Theorem 4.3. Applying forward-sweep policy computation using Q* as defined by (4.10) 
yields an optimal joint policy. 

Proof. Note that, per definition, the optimal Dec-POMDP policy vr* maximizes the expected 
future reward ^^(vr*) specified by (4.9). Therefore (5*'*, the optimal decision rule for stage t, 
is identical to an optimal joint policy /?*'* for the Bayesian game for time step t, if the payoff 
function of the BG is given by Q* , that is: 

5*'* = /3*'* = argmax V P{e')Q*{e\p\e')). (4.18) 

Equation (4.18) tells us that 5*'* = /3*'*. This means that it is possible to construct the 
complete optimal Dec-POMDP policy vr* = (5°'*, . . . ,6^~^'*), by computing 5*'* for aU t. □ 

A subtlety in the calculation of vr* is that (4.18) itself is dependent on an optimal joint 
policy, as the summation is over all 0* € 0^. = {0* [ C(0*,7r*) = 1}. This is resolved 
by realizing that only the past actions influence which action-observation histories can be 
reached at time step t. Formally, let ip'' = {6^'*, . . . ,5*"^'*) denote the past joint policy, which 
is a partial joint policy vr specified for stages 0,...,t — 1. If we denote the optimal past joint 
policy by we have that 0^, = 0^t,*, and therefore that: 

/3*'* = argmax V P(^*)g*(^*,/3*(^*)). (4.19) 

This can be solved in a forward manner for time steps t = 0,l,2,...,/i — 1, because at every 
time step v?*'* = (5°'*, . . . ,(5*"^'*) wih be available: it is specified by (/3°'*,...,/3*-^'*) the 
solutions of the previously solved BGs. 
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4.3.3 Computing Q* 

So far we discussed that Q* can be used to find an optimal joint policy vr*. Unfortunately, 
when an optimal joint policy vr* is not known, computing Q* itself is impractical, as we 
will discuss here. This is in contrast with the (fully observable) single-agent case where the 
optimal Q-values can be found relatively easily in a single sweep backward through time. 

For MDPs and POMDPs we can compute the Q-values for time step t from those for 
t + 1 by applying a backup operator. This is possible because there is a single agent that 
perceives a Markovian signal. This allows the agent to (1) select the optimal action (policy) 
for the next time step and (2) determine the expected future reward given the optimal 
action (policy) found in step 1. For instance, the backup operator for a POMDP is given 
by: 

Q*{b\a) = R{h\a) + V P(o|6*,a) max Q*(6*+\a), 

o 

which can be rewritten 2-step procedure: 

1. 7r*+i'*(6*+i) = argmax^, Q*(6*+\a') 

2. Q*(5*,a) = R{h\a) + ZoPio\b\a)Q*{b'+\TT'+^'*ib'+^)). 

In the case of Dec-POMDPs, step 2 would correspond to calculating Q* using (4.10) and 
thus depends on vr*"''^'* an optimal joint policy at the next stage. However, step 1 that 
calculates vr*"^^'*, corresponds to (4.19) and therefore is dependent on (an optimal 

joint policy for time steps 0,...,t). So to calculate the Q*'* the optimal Q-value function 
as specified by (4.10) for stage t, an optimal joint policy up to and including stage t is 
needed. Effectively, there is a dependence on both the future and the past optimal policy, 
rather than only on the future optimal policies as in the single agent case. The only clear 
solution seems to be evaluation for all possible past policies, as detailed next. We conjecture 
that the problem encountered here is inherent to all decentralized decision making with 
imperfect information. For example, we can also observe this in exact point-based dynamic 
programming for Dec-POMDPs, as described in Section 3.5, where it is necessary to to 
generate all (multiagent belief points generated by all) possible past policies. 

4.4 Sequential Rationality for Dec-POMDPs 

We conjectured that computing Q* as introduced in Section 4.3 seems impractical without 
knowing vr*. Here we will relate this to concepts from game theory. In particular, we 
discuss a different formulation of Q* based on the principle of sequential rationality, i.e., 
also considering joint action-observation histories that are not realized given an optimal 
joint policy. This formulation of Q* is computable without knowing an optimal joint policy 
in advance, and we present a dynamic programming algorithm to perform this computation. 

4.4.1 Sub-game Perfect and Sequential Equilibria 

The problem we are facing is very much related to the notion of sub-game perfect equilibria 
from game theory. A sub-game perfect Nash equilibrium vr = (vri, . . . ,vr„) has the character- 
istic that the contained policies vr^ specify an optimal action for all possible situations — even 
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situations that can not occur when following vr. A commonly given rationale behind this 
concept is that, by a mistake of one of the agents during execution, situations that should 
not occur according to vr, can occur, and also in these situations the agents should act op- 
timally. A different rationale is given by Binmore (1992), who remarks that it is "tempting 
to shrug one's shoulders at these difficulties [because] rational players will not stray from 
the equilibrium path", but that would clearly be a mistake, because the agents "remain 
on the equilibrium path because of what they anticipate would happen if they were to 
deviate". This implies that agents can decide upon a Nash equilibrium by analyzing what 
the expected outcome would be by following other policies: That is, when acting optimally 
from other situations. We will perform a similar reasoning here for Dec-POMDPs, which — 
in a similar fashion — will result in a description that allows to deduce an optimal Q-value 
function and thus joint policy. 

A Dec-POMDP can be modeled as an extensive form game of imperfect information 
(Oliehoek & Vlassis, 2006). For such games, the notion of sub-game perfect equilibria is 
inadequate; because this type of games often do not contain proper sub-games, every Nash 
equilibrium is trivially sub-game perfect.^ To overcome this problem different refinements of 
the Nash equilibrium concept have been defined, of which we will mention the assessment 
equilibrium (Binmore, 1992) and the closely related, but stronger sequential equilibrium 
(Osborne & Rubinstein, 1994). Both these equilibria are based on the concept of an assess- 
ment, which is a pair (vr,b) consisting of a joint policy tt and a belief system b. The belief 
system maps each possible situation, or information set, of an agent — also the ones that are 
not reachable given tt — to a probability distribution over possible joint histories. Roughly 
speaking, an assessment equilibrium requires sequential rationality and belief consistency.^ 
The former entails that the joint policy vr specifies optimal actions for each information 
set given b. Belief consistency means that all the beliefs that are assigned by b are Bayes 
rational given the specified joint policy vr. For instance, in the context of Dec-POMDPs b 
would prescribe, for a particular 6^ of agent i, a belief over joint histories P{6^\0l). If all 
beliefs prescribed by belief system b are Bayes-rational (i.e., computed as the appropriate 
conditionals of (4.5)), b is called belief consistent.^ 

4.4.2 Sequential Rationality and the Optimal Q-value Function 

The dependence of sequential rationality on a belief system b indicates that the optimal 
action at a particular point is dependent on the probability distribution over histories. In 
Section 4.3.3 we encountered a similar dependence on the history as specified by y?*"^^'*. 
Here we will make this dependence more exact. 

At a particular stage t, a policy is optimal or, in game-theoretic terms, rational if 
it maximizes the expected return from that point on. In Section 4.3.1, we were able to 
express this expected return as Q*{6^,a) assuming an optimal joint policy vr* is followed up 

4. The extensive form of a Dec-POMDP indeed does not contain proper sub-games, because agent can 
never discriminate between the other agents' observations. 

5. Osborne and Rubinstein (1994) refer to this second requirement as simply 'consistency'. In order to 
avoid any confusion with definition 4.5 we will use the term 'belief consistency'. 

6. A sequential equilibrium includes a more technical part in the definition of belief consistency that ad- 
dresses what beliefs should be held for information sets that are not reached according to vr. For more 
information we refer to Osborne and Rubinstein (1994). 
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to the current stage t. However, when no such previous poUcy is assumed, the maximal 
expected return is not defined. 

Proposition 4.1. For a pair {6'',a*) with t < h — 1 the optimal value Q*{6^,a^) cannot be 
defined without assuming some (possibly randomized) past policy = ((5", . . . ,5*). Only 
for the last stage t = h — 1 such expected reward is defined as 

Q*0^-^,a^-^) = R{e''-\a''-^) 

without assuming a past policy. 

Proof. Let us try to deduce Q*{9*,a*) the optimal value for a particular 6* assuming the 
Q*-values for the next time step t + 1 are known. The Q*(0*,a*)-values for each of the 
possible joint actions can be evaluated as follows 

Va Q*{e\a') = R{e\a') + ^'(o*+^|^^a*)(^*(^*+^<5*+^•*(^*+^)). 

ot+i 

where J*"*"^'* is an optimal decision rule for the next stage. But what should 5*"''^'* be? If 
we assume that up to stage t + 1 we followed a particular (possibly randomized) ^^^^ , 

<5*+^'* = argmax V P{e'+'\^'+\h'')Q* {e'+\p'+\e'+^)). 

is optimal. However, there are many pure and infinite randomized past policies ip^~^^ that are 
consistent with 6^,a'', leading to many d!^^'* that might be optimal. The conclusion we can 
draw is that Q*{0^^a^) is ill-defined without P{9^'^^\^p^~^^ ,b^), the probability distribution 
(belief) over joint action-observation histories, which is induced by (z?*"*"^, the policy followed 
for stages 0, . . . ,t. □ 

Let us illustrate this by reviewing the optimal Q-value function as defined in Sec- 
tion 4.3.1. Consider 7r*(0*'^^) in (4.10). This optimal policy is a mapping from observation 
histories to actions tt* : O ^ A induced by the individual policies and observation histories. 
This means that for two joint action-observation histories with the same joint observation 
history vr* results in the same joint action. That is Va,o,a' 7r*((a,o)) = 7r*((a',o)). Effectively 
this means that when we reach some 9^ ^ G^* , say through a mistake^, vr* continues to 
specify actions as if no mistake ever happened: That is, still assuming that vr* has been 
followed up to this stage t. In fact, 7r*{9^) might not even be optimal if 0* ©t*- Which in 
turn means that Q*{9^~^ ,a), the Q-values for predecessors of 0*, might not be the optimal 
expected reward. 

We demonstrated that the optimal Q-value function for a Dec-POMDP is not well- 
defined without assuming a past joint policy. We propose a new definition of Q* that 
explicitly incorporates 99*+^. 



7. The question as to how the mistake of one agent should be detected by another agent is a different 
matter ahogether and beyond the scope of this text. 
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Figure 7: Computation of sequential rational Q* . 5ip' is the optimal decision rule for stage 
t = 2, given that 99^ is followed for the first two stages. Q*{9^,ip'^) entries are 
computed by propagating relevant (5*-values of the next stage. For instance, 
for the highlighted joint history 6^ = ((ai,oi), (02,02)), the Q*-value under 93^ is 
computed by propagating the values of the four successor joint histories, as per 
(4.20). 



Theorem 4.4 (Sequentially rational Q*). The optimal Q-value function is properly defined 
as a function of joint action-observation histories and past joint policies, Q*{9^,(p*^^). This 
Q* specifies the optimal value given for all {6^,ip*^^), even for 6* that are not reached by 
execution of an optimal policy it* , and therefore is referred to as sequentially rational. 



Proof. For all the optimal expected return is given by 



where = 



'R{9',^'+H9')), t = h-l 

ot+1 

(4.20) 



t+1 



and 



<y^+^'* = argmax P{0'^'W'^\b')Q\9'^\{^'^\P'^')). (4.21) 

which is well-defined. □ 

The above equations constitute a dynamic program. When assuming that only pure 
joint past policies can be used, (4.21) transforms to 

(5*+^'* =argmax V P(^*+i)Q*(^*+\ ((/p*+\/3*+1)) (4.22) 
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and for all {6,<p) such that 6 is consistent with if the dynamic program can be evaluated 
from the end (t = /i — 1) to the begin {t = 0). Figure 7 illustrates the computation of Q* . 
When arriving at stage 0, the cp^ reduce to joint actions and it is possible to select 

6^'* = argmax(5*(00,a) = aTginaxQ*{9^ ,(f^). 

Then given ip^ = 5"'* we can determine 5^'* = 5^* using (4.22), etc. This essentially is 
the forward-sweep policy computation using the optimal Q- value function as 
defined by (4.20). 

The computation of Q* is also closely related to point-based dynamic programming for 
Dec-POMDPs as discussed in section 3.5. Suppose that t = 2 in Figure 7 is the last stage 
(i.e., /i = 3). When the 5^'* for all tp^ have been computed, it is easy to construct the sets of 
non-dominated action for each agent: every action Oj of agent i that is specified by some 5^'* 
is non-dominated. Once we have computed the values for all {O^^ip^) at i = 1, each (p^ has 
an associated optimal future policy 5^* . This means that each individual history 6^ has an 
associated sub-tree policy ql^"^ for each (p^ and as such each (^ ^,(/3^)-pair has an associated 
joint sub-tree policy (e.g, the shaded trees in Figure 7). Clearly, Q*i9'^^pP') corresponds to 
expected value of this associated joint sub-tree policy. Rather than keeping track of these 
sub-trees policies, however, the algorithm presented here keeps track of the values. 

The advantage of the description of Q* using (4.21) rather than (4.10) is twofold. First 
the description treated here describes the way to actually compute the values which can 
then be used to construct vr*, while the latter only gives a normative description and needs 
TT* in order to compute the Q- values. 

Second, this Q*{9^,ip^~^^) describes sequential rationality for Dec-POMDPs. For any 
past policy (and corresponding consistent belief system) the optimal future policy can be 
computed. A variation of this might even be applied on-line. Suppose agent i makes a 
mistake at stage t, executing an action not prescribed by vr*, assuming the other agents 
execute their policy vr^j without mistakes, agent i knows the actually executed previous 
policy ip^~^^. Therefore it can compute a new individual policy by 

= argmax Yl (^*^^ )• 

4.4.3 The Complexity of Computing a Sequentially Rational Q* 

Although we have now found a way to compute Q*, this computation is intractable for all 
but the smallest problems, as we will now show. At stage t—1 there are X]t'=o 1^*1 ~ j^' |_;^ 
observation histories for agent i, leading to 

I A I 10,1-1 

pure joint past policies ip*. For each of these there are = consistent joint action- 

observation histories (for each observation history o*~^, 93* specifies the actions forming 
gt-iy This means that for stage h — 2 (for h — 1, the Q-values are easily calculated), the 
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number of entries to be computed is the number of joint past policies 99 times the number 
of joint histories 

O |A| • |0| , 

indicating that computation of this function is doubly exponential, just as brute force policy 
evaluation. Also, for each joint past policy , we need to compute (^'^ = {^p^~^ ,5'^ ^'*) 
by solving the next-stage BG: 

5^-1'* = arg max V P{d'''~^)Q* {O'^'^^ , U'-^ 

To the authors' knowledge, the only method to optimally solve these BGs is evaluation of 
all 

o(iAri^-|'") 

joint BG-polices, which is also doubly exponential in the horizon. 
5. Approximate Q-value Functions 

As indicated in the previous section, although an optimal Q-value function Q* exists, it 
is costly to compute and thus impractical. In this section, we review some other Q-value 
functions, Q, that can be used as an approximation for Q* . We will discuss underlying 
assumptions, computation, computational complexity and other properties, thereby pro- 
viding a taxonomy of approximate Q-value functions for Dec-POMDPs. In particular we 
will treat two well-known approximate Q-value functions, Qmdp Qpomdpi Qbg 
recently introduced by Oliehoek and Vlassis (2007). 

5.1 Qmdp 

Qmdp '^^s originally proposed to approximately solve POMDPs by Liftman, Cassandra, 
and Kaelbling (1995), but has also been applied to Dec-POMDPs (Emery-Montemerlo 
et al., 2004; Szer et al., 2005). The idea is that Q* can be approximated using the state- 
action values (5M('S,a) found when solving the 'underlying MDP' of a Dec-POMDP. This 
'underlying MDP' is the horizon-/i MDP defined by a single agent that takes joint actions 
a & A and observes the nominal state s that has the same transition model T and reward 
model R as the original Dec-POMDP. Solving this underlying MDP can be efficiently done 
using dynamic programming techniques (Puterman, 1994), resulting in the optimal non- 
stationary MDP Q-value function: 

Q\:;is\a) = Ris\a)+ P(s*+i|s*,a) maxQ*+^'*(s*+\a). (5.1) 

t-\-l. * 

In this equation, the maximization is an implicit selection of vtj^ ' , the optimal MDP policy 
at the next time step, as explained in Section 4.3.3. Note that also is an optimal Q- 
value function, but in the MDP setting. In this article Q* will always denote the optimal 
value function for the (original) Dec-POMDP. In order to transform the (5^*(s*,a)-values 
to approximate (5M(^*5a)-values to be used the original Dec-POMDP, we compute: 
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QM{e\a) = ^Q^*(s,a)P(5|0"'*), (5.2) 
se<s 

where P{s\6^) can be computed from (4.8). Combining (5.1) and (5.2) and making the 
selection of vr^"*^'* expHcit we get: 

QM{e\a) = R0\a) + ^ P{s'+^\9\a) ^max^ Q*+^'*(s*+\7r^i(s*+i)), (5.3) 

which defines the approximate Q-value function that can be used as payoff function for the 
various BGs of the Dec-POMDP. Note that Qm is consistent with the estabUshed definition 
of Q-value functions since it is defined as the expected immediate reward of performing 
(joint) action a plus the value of following an optimal joint policy (in this case the optimal 
MDP-policy) thereafter. 

Because calculation of the Qj^^(s,a)-values by dynamic programming (which has a cost 
of 0{\S\ X h) can be performed in a separate phase, the cost of computation of Qmdp is only 
dependent on the cost of evaluation of (5.3), which is 0(|5|). When we want to evaluate 

Qmdp St'=o^ (1^1 l^*!)* = joint action-observation histories is, the total 

computational cost becomes: 

However, when applying Qmdp i^ forward-sweep policy computation, we do not have to 
consider all action-observation histories, but only those that are consistent with the policy 
found for earlier stages. Effectively we only have to evaluate (5.3) for all observation histories 
and joint actions, leading to: 

When used in the context of Dec-POMDPs, Qmdp solutions are known to undervalue 
actions that gain information (Fernandez, Sanz, Simmons, & Dieguez, 2006). This is ex- 
plained by realizing that the Qmdp solution assumes that the state will be fully observable 
in the next time step. Therefore actions that provide information about the state, and 
thus can lead to a high future reward (but might have a low immediate reward), will be 
undervalued. When applying Qmdp i^ Dec-POMDP setting, this effect can also be 
expected. Another consequence of the simplifying assumption is that the QMOP'Value func- 
tion is an upper bound to the optimal value function when used to approximate a POMDP 
(Hauskrecht, 2000), as a consequence it is also an upper bound to the optimal value function 
of a Dec-POMDP. This is intuitively clear, as a Dec-POMDP is a POMDP but with the 
additional difficulty of decentralization. A formal argument will be presented in Section 5.4. 

5.2 QpoMDP 

Similar to the 'underlying MDP', one can define the 'underlying POMDP' of a Dec-POMDP 
as the POMDP with the same T, O and i2, but in which there is only a single agent that 
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takes joint actions a A and receives joint observations o € C Qpomdp approximates Q* 
using the solution of the underlying POMDP (Szer et al., 2005; Roth et al., 2005). 

In particular, the optimal Qpomdp value function for an underlying POMDP satisfies: 

Q*p{b^\a) = R{b^\a) + V P{o'+^\b'^\a) max Q*p{b'^'^\Trt+\b^'^')), (5.6) 



where b^ is the joint belief of the single agent that selects joint actions and receives joint 
observations at time step t, where 

R{b^' ,a) = J2 R{s,a)b^' (s) (5.7) 

sG<S 

is the immediate reward, and where 6^*^^ is the joint belief resulting from 6^* by action a 
and joint observation o*"*"^, calculated by 

For each 0* there is one joint belief 6^', which corresponds to P{s\9^) as can be derived 
from (4.8). Therefore it is possible to directly use the computed Qpomdp values as payoffs 
for the BGs of the Dec-POMDP, that is, we define: 

Qp{e\a) = Q*p{b^\a). (5.9) 

The maximization in (5.6) is stated in its explicit form: a maximization over time step 
t + 1 POMDP policies. However, it should be clear that this maximization effectively is one 
over joint actions, as it is conditional on the received joint observation o*"*"^ and thus the 
resulting belief 6^*^^. 

For a finite horizon, Qp can be computed by generating all possible joint beliefs and 
solving the 'belief MDP'. Generating all possible beliefs is easy: starting with b^ corre- 
sponding to the empty joint action-observation history 9''^^, for each a and o we calculate 
the resulting 9^^^ and corresponding belief b^^ and continue recursively. Solving the belief 
MDP amounts to recursively applying (5.6). 

In the computation of Qmdp 'we could restrict our attention to only those (^*,a)-pairs 
that were specified by forward-sweep policy computation, because the (5M(^*5«)-values do 
not depend on the values of successor-histories QMi9^~^^ ,a). For Qpomdpj however, there 
is such a dependence, meaning that it is necessary to evaluate for all 6^, a. In particular, 
the cost of calculating Qpomdp can be divided in the cost of calculating the expected 
immediate reward for all 9*,a, and the cost of evaluating future reward for all with 
t = 0,...,h — 2. The former operation is given by (5.7) and has cost 0(|5|) per ^*,a and 
thus a total cost equal to (5.4). The latter requires selecting the maximizing joint action 
for each joint observation for all 9^,a with t = 0,...,h — 2, leading to 
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Figure 8: Backward calculation of Qpomdp"'^^1u^s. Note that the solutions (the highlighted 
entries) are different from those in Figure 6: Qpomdp assumes that the actions 
can be conditioned on the joint action-observation history. The highlighted '+3.1' 
entry for the Bayesian game for i = is calculated as the expected immediate 
reward (= 0) plus a weighted sum of the maximizing entry (joint action) per 
next joint observation history. When assuming a uniform distribution over joint 
observations given (01,02) the future reward is given by: +3.1 = + 0.25 x 2.0 + 
0.25 X 4.0 + 0.25 x 4.4 + 0.25 x 2.0. 



Therefore the total complexity of computing Qpqmdp becomes 

/ (|^||0|)"-^-i (|^||0|)"-i A 

O l-^l (1^1 P\) + j ■ (5.11) 

Evaluating (5.6) for all joint action-observation histories ^* € ©* can be done in a single 
backward sweep through time, as we mentioned in Section 4.3.3. This can also be visualized 
in Bayesian games as illustrated in Figure 8; the expected future reward is calculated as a 
maximizing weighted sum of the entries of the next time step BG. 

Nevertheless, solving a POMDP optimally is also known as an intractable problem. 
As a result, POMDP research in the last decade has focused on approximate solutions for 
POMDPs. In particular, it is known that the value function of a POMDP is piecewise-linear 
and convex (PWLC) over the (joint) belief space (Sondik, 1971). This property is exploited 
by many approximate POMDP solution methods (Pineau, Gordon, & Thrun, 2003; Spaan 
& Vlassis, 2005). Clearly such methods can also be used to calculate an approximate 
QpoMDP"'^^!^^ function for use with Dec-POMDPs. 

It is intuitively clear that Qpqmdp admissible heuristic for Dec-POMDPs, as 

it still assumes that more information is available than actually is the case (again a formal 
proof will be given in Section 5.4). Also it should be clear that, as fewer assumptions are 
made, Qpqmdp should yield less of an over-estimation than Qmdp- l-^-i Qpomdp"'^^1u6s 
should lie between the Qmdp ^'^d optimal (5*-values. 

In contrast to QmdP) Qpomdp does not assume full observability of nominal states. 
As a result the latter does not share the drawback of undervaluing actions that will gain 
information regarding the nominal state. When applied in a Dec-POMDP setting, however, 
Qpomdp does share the assumption of centralized control. This assumption might also 
cause a relative undervaluation: there might be situations where some action might gain 
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information regarding the joint (i.e., each other's) observation history. Under Qpqmdp ^^i^ 
will be considered redundant, while in decentralized execution this might be very beneficial, 
as it allows for better coordination. 

5.3 Qbg 

Qmdp approximates Q* by assuming that the state becomes fully observable in the next 
time step, while Qpomdp assumes that at every time step t the agents know the joint 
action-observation history 9^. Here we present a new approximate Q-value function, called 
QgQ, that relaxes the assumptions further: it assumes that the agents know the joint 
action-observation history up to time step t — 1, and the joint action a*~^ that was taken 
at the previous time step. This means that the agents are uncertain regarding each other's 
last observation, which effectively defines a BG for each 9^~^,a. Note, that these BGs are 
different from the BGs used in Section 4.2: the BGs here have types that correspond to 
single observations, whereas the BGs in 4.2 have types that correspond to complete action- 
observation histories. Hence, the BGs of Qbg much smaller in size and thus easier to 
solve. Formally Qbg is defined as: 

QUO\a) = Ri9\a) + max P{o'^^\e\a)Ql{e'+\p{o'+^)), (5.12) 

where j3 = (/3i(o^''~^ ),..., /3„(o^+-^)) is a tuple of individual policies /3i : Oi Ai for the BG 
constructed for 0*,a. 

Note that the only difference between (5.12) and (5.6) is the position and argument 
of the maximization operator: (5.12) maximizes over a (conditional) BG-policy, while the 
maximization in (5.6) is effectively over unconditional joint actions. 

The BG representation of the fictitious Dec-POMDP in Figure 6 illustrates the com- 
putation of Qbg-^ probability distribution -P(6|„^ ^2)) '^^^'^ joint action-observation 
histories that can be reached given (01,02) at t = is uniform and the immediate reward 
for (01,02) is 0. Therefore, we have that 2.75 = 0.25 • 2.0 + 0.25 • 3.6 + 0.25 • 4.4 + 0.25 • 1.0. 

The cost of computing Qbg ^ ^^i*^ be split up in the cost of computing the 
immediate reward (see (5.4)) and the cost of computing the future reward (solving a BG 
over the last received observation), which is 

/ (1^1 101)^^^-1 

^\ {\A\\0\)-1 J' 

leading to a total cost of: 

^( {\A\\0\t-'-l ,^, , , , (|.4||0|)"-1 ,^„„^ 

Comparing to the cost of computing Qpqmdp 1 this contains an additional exponential term, 
but this term does not depend on the horizon of the problem. 



8. Because the BG representing t = 1 of a Dec-POMDP also involves observation histories of length 1, the 
illustration of such a BG corresponds to the BGs as considered in Q^q. For other stages this is not the 
case. 
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As mentioned in Section 5.2, Qpqmdp can be approximated by exploiting the PWLC- 
property of the value function. It turns out that the QgQ-value function corresponds to an 
optimal value function for the situation where the agents can communicate freely with a one- 
step delay (Oliehoek et al., 2007b). Hsu and Marcus (1982) showed how a complex dynamic 
program can be constructed for such settings and that the resulting value function also 
preserves the PWLC property. Not surprisingly, the QgQ-value function also is piecewise- 
linear and convex over the joint belief space and, as a result, approximation methods for 
POMDPs can be transferred to the computation of Q^q (Oliehoek et al., 2007b). 

5.4 Generalized Qbg Bounds 

We can think of an extension of the QgQ-value function framework to the case of A;-steps de- 
layed communication, where each agent perceives the joint action-observation history with 
k stages delay. That is, at stage t, each agent i knows 0*"'^ the joint action-observation his- 
tory of k stages before in addition to its own current action-observation history 0/. Similar 
A;-step delayed observation models for decentralized control have been previously proposed 
by Aicardi, Davoli, and Minciardi (1987) and Ooi and Wornell (1996). In particular Aicardi 
et al. consider the Dec-MDP setting in which agent i's observations are local states Sj and 
where a joint observation identifies the state s = (si, . . . ,s„). Ooi and Wornell examine 
the decentralized control of a broadcast channel over an infinite horizon, where they allow 
the local observations to be arbitrary, but still require the joint state to be observed with 
a fc-steps delay. Our assumption is less strong, as we only require observation of d''~^ and 
because we assume the general Dec-POMDP (not Dec-MDP) setting. 

Such a fe-step delayed communication model for the Dec-POMDP setting allows ex- 
pressing the different Q-value functions defined in this article as optimal value functions 
of appropriate A;-step delay models. More importantly, by resorting to such a fc-step delay 
model we can prove a hierarchy of bounds that hold over the various Q-functions defined 
in this article: 

Theorem 5.1 (Hierarchy of upper bounds). The approximate Q-value functions Qbg '^^^ 
QpoMDP correspond to the optimal Q-value functions of appropriately defined k-step delayed 
communication models. Moreover these Q-value functions form a hierarchy of upper hounds 
to the optimal Q* of the Dec-POMDP: 

Q* < Qbg — Qpomdp ^ Qmdp- (5-14) 
Proof. See appendix. □ 

The idea is that a POMDP corresponds to a system with no (0-steps) delayed com- 
munication, while the QgQ-setting corresponds to a 1-step delayed communication system. 
The appendix shows that the Q-value function of a system with k steps delay forms an 
upper bound to that of a decentralized system with k-^1 steps delay. We note that the last 
inequality of (5.14) is a well-known result (Hauskrecht, 2000). 

6. Generalized Value-Based Policy Search 

The hierarchy of approximate Q-value functions implies that all of these Q-value functions 
can be used as admissible heuristics in MAA* policy search, treated in Section 3.3. In 
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Algorithm 1 GMAA* 

1: V* < OO 

2: P ^ = ()} 

3: repeat 

4: if^ ^ Select(P) 

5: ^Next ^ Next((^*) 

6: if <^Next contains a subset of full policies Ilrjext ^ ^J'Next then 

7: vr' ^ arg max^gnMext ^(^) 

8: if y(7r') > V* then 

9: V* ^ y(7r') 

10: vr* ^ vr' 

11: P ^ G P I V{ip) > v*|{prune the policy pool} 

12: end if 

13: $Next ^ '^'Next \ Hrjext {remove full policies} 

14: end if 

15: P ^ (P \ U € <I>Next | Vif) > v*|{remove processed/add new partial policies} 

16: until P is empty 



this section we will present a more general heuristic policy search framework which we will 
call Generalized MAA* (GMAA*), and show how it unifies some of the solution methods 
proposed for Dec-POMDPs. 

GMAA* generalizes MAA* (Szer et al., 2005) by making explicit different procedures 
that are implicit in MAA* : (1) iterating over a pool of partial joint policies, pruning this pool 
whenever possible, (2) selecting a partial joint policy from the policy pool, and (3) finding 
some new partial and/or full joint policies given the selected policy. The first procedure is 
the core of GMAA* and is fixed, while the other two procedures can be performed in many 
ways. 

The second procedure. Select, chooses which policy to process next and thus deter- 
mines the type of search (e.g., depth-first, breadth-first, A*-like) (Russell &: Norvig, 2003; 
Bertsekas, 2005). The third procedure, which we will refer to as Next, determines how 
the set of next (partial) joint policies are constructed, given a previous partial joint policy. 
The original MAA* can be seen as an instance of the generalized case with a particular 
Next-operator, namely that shown in algorithm 2. 

6.1 The GMAA* Algorithm 

hi GMAA* we refer to a 'policy pool' P rather than an open list, as it is a more neutral 
word which does not imply any ordering. This policy pool P is initialized with a completely 
unspecified joint policy ip^ = () and the maximum lower bound (found so far) v* is set 
to — OO. vr* denotes the best joint policy found so far. 

At this point GMAA* starts. First, the selection operator. Select, selects a partial joint 
policy ip from P. We will assume that, in accordance with MAA*, the partial policy with 
the highest heuristic value is selected. In general, however, any kind of selection algorithm 
may be used. Next, the selected policy is processed by the policy search operator Next, 
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Algorithm 2 Next(v3*) — MAA* 
3: return 



which returns a set of (partial) joint pohcies <I?Next and their heuristic values. When Next 
returns one or more full policies tt € $Nexti the provided values l^(vr) = V{t^) are a lower 
bound for an optimal joint policy, which can be used to prune the search space. Any found 
partial joint policies i-p € ^'Next with a heuristic value T^(v') > v* are added to P. The 
process is repeated until the policy pool is empty. 

6.2 The Next Operator 

Here we describe some different choices for the Next-operator and how they correspond to 
existing Dec-POMDP solution methods. 

6.2.1 MAA* 

GMAA* reduces to standard MAA* by using the Next-operator described by Algorithm 2. 
Line 1 expands forming <^*+^ the set of partial joint policies for one extra stage. Line 2 
valuates all these child policies, where 

gives the true expected reward over the first t -|- 1 stages. - is the heuristic 

value over stages (t -|- given that has been followed the first t -|- 1 stages. 

When using an admissible heuristic, GMAA* will never prune a partial policy that can 
be expanded into an optimal policy. When combining this with the fact that the MAA*- 
Next operator returns all possible 93*^^ for a 99*, it is clear that when P becomes empty an 
optimal policy has been found. 

6.2.2 Forward-Sweep Policy Computation 

Forward-sweep policy computation, as introduced in Section 4.3.1, is described by algo- 
rithms 1 and 3 jointly. Given a partial joint policy the Next operator now constructs 
and solves a BG for time step t. Because Next in algorithm 3 only returns the best-ranked 
policy, P will never contain more than 1 joint policy and the whole search process reduces 
to solving BGs for time steps 0,...,/i — 1. 

The approach of Emery-Montemerlo et al. (2004) is identical to forward-sweep policy 
computation, except that 1) smaller BGs are created by discarding or clustering low proba- 
bility action-observation histories, and 2) the BGs are approximately solved by alternating 
maximization. Therefore this approach can also be incorporated in the GMAA* policy 
search framework by making the appropriate modifications in Algorithm 3. 
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Algorithm 3 Next(y3*) — Forward-sweep policy computation 

2: for all /3 = (/3i,...,/3„) s.t. ft : Oj ^ A do 

3: V'{(3) ^ Ee^eei, P0')Q\e\fi{d')) 

4: ^ ((^*,/3) 

6: end for 

7: return arg max t+i 



6.2.3 Unification 

Here we will give a unified perspective of the MAA* and forward-sweep policy computation 
by examining the relation between the corresponding Next-operators. In particular we 
show that, when using any of the approximate Q- value functions described in Section 5 as 
a heuristic, the sole difference between the two is that FSPC returns only the joint policy 
with the highest heuristic value. 



Proposition 6.1. // a heuristic Q has the following form 

Q\e\a) = R{e\a) + ^P(o*+i|^*,a)y*+i(^*+i), 

oi+l 

then for a partial policy c^*"*"^ = 

^ p{e')Q\e\p{e')) = e [R{s\a)\ip'+^] + 



(6.1) 



(6.2) 



holds. 

Proof. The expectation of given 93*+^ can be written as 
E[R{s\a)\^'+^]= P{d')Y,R{s,^'^\0'))P{s\e')= P{6')R{e\^'+\9')). 

Also, we can rewrite y(*+i)---'^((p*+i) as 

^{t+i)...h(^m) ^ ^ p(0"'*)^p(o*+i|0"'*,^*+i(0"'*))y(*+i)-'^(0"'*+i). 



*e0* 



such that 



6»*ee* 



R{e\^^+^{e*)) + ^p(o*+i|^*,99*+i(^*))y(*+^)-'^(^*+^) 

Therefore, assuming (6.1) yields (6.2). 



(6.3) 
□ 
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00 go house 3 O0 — go house 2 

flames go house 3 flames go house 2 

no flames go house 1 no flames — >■ go house 2 

flames, flames — > go house 1 flames, flames — > go house 1 

flames, no flames — >■ go house 1 flames, no flames — >■ go house 1 

no flames, flames — go house 2 no flames, flames — >■ go house 1 

no flames, no flames — > go house 2 no flames, no flames — ?> go house 1 

Figure 9: Optimal policy for FireFighting {uh = SjHf = 3), horizon 3. On the left the 
policy for the first agent, on the right the second agent's policy. 



This means that if a heuristic satisfies (6.1) , which is the case for all the Q-value functions 
we discussed in this paper, the Next operators of algorithms 2 and 3 evaluate the expanded 
policies the same. I.e., algorithms 2 and 3 calculate identical heuristic values for the same 
next time step joint policies. Also the expanded policies (z?*"*"^ are formed in the same way: 
by considering all possible (5* respectively /3* to extend (p^. Therefore, the sole difference in 
this case is that the latter returns only the joint policy with the highest heuristic value. 

Clearly there is a computation time/quality trade-off between MAA* and FSPC: MAA* 
is guaranteed to find an optimal policy (given an admissible heuristic), while FSPC is 
guaranteed to finish in one forward sweep. We propose a generalization, that returns the 
A;-best ranked policies. We refer to this as the 'fc-best joint BG policies' GMAA* variant, or 
fe-GMAA*. In this way, /c-GMAA* reduces to forward-sweep policy computation for k = 1 
and to MAA* for A; = oo. 

7. Experiments 

In order to compare the different approximate Q-value functions discussed in this work, 
as well as to show the flexibility of the GMAA* algorithm, we have performed several 
experiments. We use Qmdpi Qpomdp Qbg heuristic estimates of Q* . We will 
provide some qualitative insight in the different Q-value functions we considered, as well 
as results on computing optimal policies using MAA*, and on the performance of forward- 
sweep policy computation. First we will describe our problem domains, some of which are 
standard test problems, while others are introduced in this work. 

7.1 Problem Domains 

In Section 2.2 we discussed the decentralized tiger (Dec-Tiger) problem as introduced by 
Nair et al. (2003b). Apart from the standard Dec-Tiger domain, we consider a modified 
version, called Skewed Dec-Tiger, in which the start distribution is not uniform. Instead, 
initially the tiger is located on the left with probability 0.8. We also include results from the 
BroadcastChannel problem, introduced by Hansen et al. (2004), which models two nodes 
that have to cooperate to maximize the throughput of a shared communication channel. 
Furthermore, a test problem called "Meeting on a Grid" is provided by Bernstein et al. 
(2005), in which two robots navigate on a two-by-two grid. We consider the version with 2 
observations per agent (Amato et al., 2006). 
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We introduce a new benchmark problem, which models a team of n fire fighters that 
have to extinguish fires in a row of nh houses. Each house is characterized by an integer 
parameter /, or fire level. It indicates to what degree a house is burning, and it can have nj 
different values, < / < n/. Its minimum value is 0, indicating the house is not burning. At 
every time step, the agents receive a reward of — / for each house and each agent can choose 
to move to any of the houses to fight fires at that location. If a house is burning (/ > 0) and 
no fire fighting agent is present, its fire level will increase by one point with probability 0.8 
if any of its neighboring houses are burning, and with probability 0.4 if none of its neighbors 
are on fire. A house that is not burning can only catch fire with probability 0.8 if one of 
its neighbors is on fire. When two agents are in the same house, they will extinguish any 
present fire completely, setting the house's fire level to 0. A single agent present at a house 
will lower the fire level by one point with probability 1 if no neighbors are burning, and 
with probability 0.6 otherwise. Each agent can only observe whether there are flames or 
not at its location. Flames are observed with probability 0.2 if / = 0, with probability 0.5 
if / = 1, and with probability 0.8 otherwise. Initially, the agents start outside any of the 
houses, and the fire level / of each house is drawn from a uniform distribution. 

We will test different variations of this problems, where the number of agents is always 
2, but which differ in the number of houses and fire levels. In particular, we will consider 
{rih = 3,nj = 3) and {n^ = A^iif = 3). Figure 9 shows an optimal joint policy for horizon 3 
of the former variation. One agent initially moves to the middle house to fight fires there, 
which helps prevent fire from spreading to its two neighbors. The other agent moves to 
house 3, and stays there if it observes fire, and moves to house 1 if it does not observe 
flames. As well as being optimal, such a joint policy makes sense intuitively speaking. 

7.2 Comparing Q- value Functions 

Before providing a comparison of performance of some of the approximate Q- value functions 
described in this work, we will first give some more insights in their actual values. For the 
h = 4 Dec-Tiger problem, we generated all possible 6^ and the corresponding P{si\9^), 
according to (4.8). For each of these, the maximal Q{9^ ,a)-value is plotted in Figure 10. 
Apart from the three approximate Q- value functions, we also plotted the optimal value 
for each joint action-observation history 6^ that can be realized when using vr*. Note that 
different 0* can have different optimal values, but induce the same ^(5/16**), as demonstrated 
in the figure: there are multiple Q*-values plotted for some P{si\9^). For the horizon 3 
Meeting on a Grid problem we also collected all 0* that can be visited by the optimal 
policy, and in Figure 11 we again plotted maximal Q{6^ ,a)-values. Because this problem 
has many states, a representation as in Figure 10 is not possible. Instead, we ordered the 
6 according to their optimal value. We can see that the bounds are tight for some 6, while 
for others they can be quite loose. However, when used in the GMAA* framework, their 
actual performance as a heuristic also depends on their valuation of € not shown by 
Figure 11, namely those that will not be visited by an optimal policy: especially when 
these are overestimated, GMAA* will first examine a sub-optimal branch of the search tree. 
A tighter upper bound can speed up computation to a very large extent, as it allows the 
algorithm to prune the policy pool more, reducing the number of Bayesian games that need 
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Figure 10: Q-values for horizon 4 Dec-Tiger. For each 0*, corresponding to some P(s;|^*), 
the maximal Q(6'*,a)-value is plotted. 



to be solved. Both figures clearly illustrate the main property of the upper bounds we 
discussed, namely that Q* < Qbg ^ Qpomdp ^ Qmdp (see Theorem 5.1). 

7.3 Computing Optimal Policies 

As shown above, the hierarchy of upper bounds Q* < Qbg ^ Qpomdp ^ Qmdp is not 
just a theoretical construct, but the differences in value specified can be significant for 
particular problems. In order to evaluate what the impact is of the differences between 
the approximate Q-value functions, we performed several experiments. Here we describe 
our evaluation of MAA* on a number of test problems using Qbq, Qpomdp ^^i^d Qmdp 
heuristic. All timing results in this paper are CPU times with a resolution of 0.01s, and 
were obtained on 3.4GHz Intel Xeon processors. 
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Figure 11: Comparison of maximal (5(0*,a)-values for Meeting on a Grid. We pfot the 
value of all 0* that can be reached by an optimal policy, ordered according their 
optimal value. 
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Table 1: MAA* results for Dec-Tiger. 



h 


V* 






^GMAA* 


Tq 






Qmdp 


151,236 


0.46 s 


s 


3 


5.8402 


QpOMDP 


19,854 


0.06 s 


0.01 s 






Qbg 


13,212 


0.04 s 


0.03 s 






Qmdp 


33,921,256,149 


388,894 s 


s 


4 


11.1908 


QpOMDP 


774,880,515 


8,908 s 


0.13 s 






Qbg 


86,106,735 


919 s 


0.92 s 



Table 2: MAA* results for Skewed Dec-Tiger. 
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Table 3: MAA* results for BroadcastChannel. 
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Table 4: MAA* results for Meeting on a Grid. 



Table 1 shows the results MAA* obtained on the original Dec-Tiger problem for horizon 
3 and 4. It shows for each heuristic the number of partial joint policies evaluated n^, CPU 
time spent on the GMAA* phase Tq^^aa* i CPU time spent on calculating the heuristic 
Tq. As Qbg 5 Qpomdp Qmdp upper bounds to Q*, MAA* is guaranteed to find the 
optimal policy when using them as heuristic, however the timing results may differ. 

For h = 3 we see that using Qpomdp ^'^d Qbg only a fraction of the number of policies 
are evaluated when compared to Qmdp which reflects proportionally in the time spent on 
GMAA*. For this horizon Qpomdp Qbg perform the same, but the time needed to 
compute the Qbg heuristic is as long as the GMAA*-phase, therefore Qpomdp outperforms 
Qbg here. For /i = 4, the impact of using tighter heuristics becomes even more pronounced. 
In this case the computation time of the heuristic is negligible, and Qbg outperforms both, 
as it is able to prune much more partial joint policies from the policy pool. Table 2 shows 
results for Skewed Dec-Tiger. For this problem the Qmdp ^^'^ Qbg results are roughly the 
same as the original Dec-Tiger problem; for h = 3 the timings are a bit slower, and for 
h = A they are faster. For Qpomdp ' however, we see that for /i = 4 the results are slower 
as well and that Qbg outperforms Qpomdp by an order of magnitude. 

Results for the Broadcast Channel (Table 3), Meeting on a Grid (Table 4) and a Fire 
fighting problem (Table 5) are similar. The N/A entry in Table 3 indicates the Qmdp 
not able to compute a solution within 5 days. For these problems we also see that the 
performance of Qpomdp Qbg roughly equal. For the Meeting on a Grid problem, 
Qbg yields a significant speedup over Qpomdp- 
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Table 5: MAA* results for Fire Fighting {uh = 3,n/ = 3). 
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Figure 12: Policies found using forward-sweep policy computation (i.e., k = 1) for the /i = 4 
Dec-Tiger problem. Left: the policy resulting from Qmdp- Right: the optimal 
policy as calculated by Qpomdp Qbg- The framed entries highlight the 
crucial differences. 



7.4 Forward-Sweep Policy Computation 

The MAA* results described above indicate that the use of a tighter heuristic can yield 
substantial time savings. In this section, the approximate Q-value functions are used in 
forward-sweep policy computation. We would expect that when using a Q-value function 
that more closely resembles Q*, the quality of the resulting policy will be higher. We also 
tested whether /c-GMAA* with A: > 1 improved the quality of the computed policies. In 
particular, we tested k = 1,2, ... ,5. 

For the Dec-Tiger problem, /c-GMAA* with k = 1 (and thus also 2 < A; < 5) found 
the optimal policy (with V{Tr*) = 5.19) for horizon 3 using all approximate Q-value func- 
tions. For horizon /i = 4, also all different values of k produced the same result for each 
approximate Q-value function. In this case, however, Qmdp found a policy with expected 
return of 3.19. Qpomdp Qbg the optimal policy (^(vr*) = 4.80). Figure 12 

illustrates the optimal policy (right) and the one found by Qmdp (left). It shows that Qmdp 
overestimates the value for opening the door in stage t = 2. 
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Figure 13: A;-GMAA* results for different problems and horizons. The y-axis indicates value 
of the initial joint belief, while the x-axis denotes k. 



For the Skewed Dec-Tiger problem, different values of k did produce different results. In 
particular, for h = 3 only Qbg finds the optimal policy (and thus attains the optimal value) 
for all values of k, as shown in Figure 13(a). Qpqmdp does find it starting from k = 2, 
and Qmdp only from k = 5. Figure 13(b) shows a somewhat unexpected result for h = A: 
here for = 1 Qmdp ™d Qbg the optimal policy, but Qpqmdp doesn't. This clearly 
illustrates that a tighter approximate Q-value function is not a guarantee for a better joint 
policy, which is also illustrated by the results for GridSmall in Figure 13(c). 

We also performed the same experiment for two settings of the FireFighting problem. 
For (n/j = 3,nj = 3) and /i = 3 all Q-value functions found the optimal policy (with value 
—5.7370) for all k, and horizon 4 is shown in Figure 13(d). Figures 13(e) and 13(f) show 
the results for {nn = ^,nf = 3). For /i = 4, Qmdp did not finish for /c > 3 within 5 days. 

It is encouraging that for all experiments A;-GMAA* using Qbg and Qpqmdp with k <2 
found the optimal policy. Using Qmdp optimal policy was also always found with k < 5, 
except in horizon 4 Dec-Tiger and the {ufi = 4:,nf = 3) FireFighting problem. These results 
seem to indicate that this type of approximation might be likely to produce (near-) optimal 
results for other domains as well. 
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8. Conclusions 

A large body of work in single-agent decision-theoretic planning is based on value functions, 
but such theory has been lacking thus far for Dec-POMDPs. Given the large impact of value 
functions on single-agent planning under uncertainty, we expect that a thorough study of 
value functions for Dec-POMDPs can greatly benefit multiagent planning under certainty. 
In this work, we presented a framework of Q-value functions for Dec-POMDPs, providing a 
significant contribution to fill this gap in Dec-POMDP theory. Our theoretical contributions 
have lead to new insights, which we applied to improve and extend solution methods. 

We have shown how an optimal joint policy vr* induces an optimal Q-value function 
Q*{6^,a), and how it is possible to construct the optimal policy vr* using forward-sweep 
policy computation. This entails solving Bayesian games for time steps t = 0,...,h — 1 
which use Q*(6^,a) as the payoff function. Because there is no clear way to compute 
Q*{9^,a), we introduced a different description of the optimal Q-value function Q*(6'',ip''~^^) 
that is based on sequential rationality. This new description of Q* can be computed using 
dynamic programming and can then be used to construct vr*. 

Because calculating Q* is computationally expensive, we examined approximate Q-value 
functions that can be calculated more efficiently and we discussed how they relate to Q*. 
We covered Qmdpj Qpomdpi ™d Qbgi ^ recently proposed approximate Q-value function. 
Also, we established that decreasing communication delays in decentralized systems cannot 
decrease the expected value and thus that Q* < Q^q < Qpqmdp ^ Qmdp- Experimental 
evaluation indicated that these upper bounds are not just of theoretical interest, but that 
significant differences exist in the tightness of the various approximate Q-value functions. 

Additionally we showed how the approximate Q-value functions can be used as heuris- 
tics in a generalized policy search method GMAA*, thereby unifying forward-sweep policy 
computation and the recent Dec-POMDP solution techniques of Emery-Montemerlo et al. 
(2004) and Szer et al. (2005). Finally, we performed an empirical evaluation of GMAA* 
showing significant reductions in computation time when using tighter heuristics to calculate 
optimal policies. Also Qbg generally found better approximate solutions in forward-sweep 
policy computation and the '/c-best joint BG policies' GMAA* variant, or /c-GMAA*. 

There are quite a few directions for future research. One is to try to extend the results 
of this paper to partially observable stochastic games (POSGs) (Hansen et al., 2004), which 
are Dec-POMDPs with an individual reward function for each agent. Since the dynamics of 
the POSG model are identical to those of a Dec-POMDP, a similar modeling via Bayesian 
games is possible. An interesting question is whether also in this case, an optimal (i.e., 
rational) joint policy can be found by forward-sweep policy computation. 

Staying within the context of Dec-POMDPs, a research direction could be to further 
generalize GMAA*, by defining other Next or Select operators, with the hope that the 
resulting algorithms will be able to scale to larger problems. Also it is important to establish 
bounds on the performance and learning curves of GMAA* in combination with different 
Next operators and heuristics. A different direction is to experimentally evaluate the use 
of even tighter heuristics such as Q-value functions for the case of observations delayed by 
multiple time steps. This research should be paired with methods to efficiently find such 
Q-value functions. Finally, future research should further examine Bayesian games. In 
particular, the work of Emery-Montemerlo et al. (2005) could be used as a starting point 
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for further research to approximately modehng Dec-POMDPs using BGs. Finahy, there is 
a need for efficient approximate methods for solving the Bayesian games. 
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Appendix A. Proofs 

A.l There is At Least One Optimal Pure Joint Policy 

Proposition (2.1). A Dec-POMDP has at least one optimal pure joint policy. 

Proof. This proof follows a proof by Schoute (1978). It is possible to convert a Dec-POMDP 
to an extensive game and thus to a strategic game, in which the actions are pure policies for 
the Dec-POMDP (Oliehoek & Vlassis, 2006). In this strategic game, there is at least one 
maximizing entry corresponding to a pure joint policy which we denote VTmax- Now, assume 
that there is a joint stochastic policy <^ = . . . ,?n) that attains a higher payoff. Kuhn 
(1953) showed that for each stochastic q policy, there is a corresponding mixed policy /ij. 
Therefore <; corresponds to a joint mixed policy fi = (/ii, ... Let us write Hj^^. for 

the support of Hi. n now induces a probability distribution over the set of joint policies 
= IIi^^^ X • • • X Iln,n„ 'HL n which is a subset of the set of all joint policies. The expected 
payoff can now be written as 

y(?) = Ep^{V{7,)\7T G n^) < maxy(7r) = ^(w), 
contradicting that <f is a joint stochastic policy that attains a higher payoff. □ 
A. 2 Hierarchy of Q- value Functions 

This section lists the proof of theorem 5.1. It is ordered as follows. First, Section A. 2.1 
presents a model and resulting value functions for Dec-POMDPs with fc-steps delayed com- 
munication. Next, Section A. 2. 2 shows that QpomdP' Qbg ^^^'^ Q* correspond with the 
case that k is respectively 0, 1 and h. Finally, Section A. 2. 3 shows that when the commu- 
nication delay k increases, the optimal expected return cannot decrease, thereby proving 
theorem 5.1. 

A. 2.1 Modeling Dec-POMDPs with A;-Steps Delayed Communication 

Here we present an augmented MDP that can be used to find the optimal solution for 
Dec-POMDPs with k steps delayed communication. This is a reformulation of the work by 
Aicardi et al. (1987) and Ooi and Wornell (1996), extended to the Dec-POMDP setting. 
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Figure 14: Policies specified by the states of tlie augmented MDP for k = 2. Top: policies 
for s*. The policy extended by augmented MDP action a* = /3 is shown dashed. 
Bottom: The resulting policies after for joint observation (01,02). 



We define this augmented MDP as M = {S ,A,T ,R) , where the augmented MDP stages are 
indicated t. 

The state space is 5 = (5*^'^, . . . ,S^^^~^). An augmented state is composed of a joint 
action-observation history, and a joint policy tree q^. 

■~t=t _ / , < t < /i - /c - 1 

^ ~ \(^*,g^=''-*'*) ,h-k<t<h-l' 

The contained is a joint depth-/c (specifying actions for k stages) joint policy tree = 
), to be used starting at stage t. For the last k stages, the contained joint policy 
^T=h-t,t gpgcifigg ^ — _ ^ < ^ stages. 

A is the set of augmented actions. For 0<t<h — k — 1, an action a* G ^ is a joint 
policy d*^* = /S^^'^ = . . . /S*^'^) implicitly mapping length- A: observation histories to 

joint actions to be taken at stage t + k. I.e., /S*^*^ : — )■ A!'^^^ . For the last k stages 
h — k <t <h — 1 there only is one empty action that has no influence whatsoever. 

The augmented actions are used to expand the joint policy trees. When 'appending' a 
policy /3*+'^ to we form a depth k + 1 policy, which we denote q'^=^+'^^* = (qr* 0/3*+*^). After 
execution of its initial joint action q'^^^'^^'^{%) and receiving a particular joint observation o. 



340 



Optimal and Approximate Q-Value Functions for Dec-POMDPs 




Figure 15: An illustration of the augmented MDP with k = 2, showing a transition from 
s* to s*"*"-*^ by action a = In this example 0^~^~^^ = {9^~^~^^, (01,02) , (01,02)). 
The actions specified for stage t are given by (01,02)) as depicted in Figure 14. 



a gr'^='^+i'* reduces to its depth k sub-tree policy for that particular joint observation, denoted 
qt-k+i ^ qT=k+i,t(^^^ ^ ^qt-k Q /3*)(o). This is illustrated in Figure 14. 

T is the transition model. A probability P(s*"''^|s*,a*) for stage t = t translates as follows 
forO<t</i-fc-l 



t t\ ot+k\ 



P{o^^^\9^ ,q^{oiii)) if conditions hold, 
otherwise. 



(A.l) 



where the conditions are: 1) g*"*"^ = (g* o /3*+'^)(o*+-^), and 2) 0*+^ = {6^,a'^,o^~^^). For 
h — k<t<h — 1, I3^~^^ in (A.l) reduces to 00. The probabilities are unaffected, but the 
first condition changes to qT=h-t-i,t-iri _ qT=h-t,t^Qt+iy 

Finally, R is the reward model, which is specified as follows: 



Vi 



0<t<h~l 



R{s 



t=t\ 



R{{e\q')) = R{9\q\d^)) 



(A.2) 



where q^{ofij) is the initial joint action specified by g*. R{6^,a) is defined as before in (2.12). 

The resulting optimality equations Q^{s,a) for the augmented MDP are as follows. We 
will write for the optimal Q-value function for a /c-steps delayed communication system. 
We will also refer to this as the k-Q^Q value function. 



yo<t<h-k-l Qk{0\q\P'+^) = R{e\q\o^))+ J2 P{o'^^\e\q\3^))Ql{e'+\q'^^)A^-^) 
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with = (g* o /3*"'"'^)(o*"'"^) and where 

Ql{e\q') = maxQfc(0"**,(7*,/?*+'^). (A.4) 

For the last k stages, h — k<t<h — 1, there are t' = h — t stages to go and we get 

Qlie',q^=^''') = R{e\q^=^'^\o^)) + P{o'^^\e\q^=^''\3^))Ql{e'+\q^=^'-^''+^). (A.5) 

ot+i 

Note that (A.5) does not include any augmented actions a*^* = (3^^^ . Therefore, the last 
k stages should be interpreted as a Markov chain. Standard dynamic programming can be 
applied to calculate all (5*(6'*,g*)-values. 

A. 2. 2 Relation of /c-Qbg with Other Approximate Q-value Functions 

Here we briefly show how A;-Qbq in fact reduces to some of the cases treated earlier. 

For /c = 0, A;-Qbq (A. 3) reduces to Qpqmdp- k = case, q^~^ becomes a depth- 

0, i.e. empty, policy. Also, /3* becomes a mapping from length-0 observation histories to 
actions, i.e., it becomes a joint action. Substitution in (A. 3) yields 

go((^"**,0) = Rie',a') + J2 P{o'+^\0\a') maxQo((^"'*+',0) 

Now, as Qp{9^,a) = Qp{b^\a), this clearly corresponds to the QpQ^/[j3p-value function 
(5.6). 

1-Qbq reduces to regular Qbg- Notice that for A; = 1, g"^-^'* reduces to a*. Filling out 
yields: 

Qi{{e\a') ,/3*+^) = R{e\a') + P(o*+>"'*,a*) max Qi((0"'*+\/3*+^(o*+i)) 

Now using (A.4) we obtain the QgQ-value function (5.12). 

A Dec-POMDP is identical to an /i-steps delayed communication system. Augmented 
states have the form s^^^ = {0^,q^), where q^ = specifies a full length h joint policy. The 
first stage t = in this augmented MDP, is also one of the last k (= h) stages. Therefore, 
the applied Q-function is (A.5), which means that the Markov chain evaluation starts im- 
mediately. Effectively this boils down to evaluation of all joint policies (corresponding to 
all augmented start states). The maximizing one specifies the value function of an optimal 
joint policy Q* . 



A. 2. 3 Shorter Communication Delays cannot Decrease the Value 

First, we introduce some notation. Let us write for all the observation probabilities given 
0^ ,q^ and the sequence of 'intermediate observations' (o*"*"^, . . . ,o*+^~^) 



Poio 



t+l\ 



P 



\{d',q\oQ),o'+\q\o'+'), 



,0 



t+l-l\ J/A+l-l 



yi < k. 



(A.6) 
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In order to avoid confusion, we write /3*^| for a policy that implicitly maps A;-lengtli 
observation histories to actions, and for one that is a mapping from length (A; + 1) 

observation-histories to actions. 

Now we give a reformulation of Qk- Q^* i^^ ^Q^) specifies the expected return for 9^,q^ 
over stages t,t + 1, . . . ,h — 1. Here, we will split this 



Qlie\q')=Kkie\q')+Fl'*ie\q') 



(A.7) 



in Kk{9^,q^), the expected k-step reward, i.e., the expected return over for stages t, . . . ,t+k—l 
and the expected return over stages t + k,t + k + 1, . . . ,h — 1, referred to as the 

'in A:-steps' expected return. 
The former is defined as 



Kkie\q') = E 



't+k-l 

. t'=t 



(A. 



Let us define K"^ '^{6 ,q'^ ) as the expected reward for the next i stages, i.e., 

Kkie\q') = K-=\e\q'). 
R{e\a^) and 



(A.9) 



We then have K^=^(9\a*) 



K-=\9\q^=^'') = R{e\q^=^'\o^)) + 

J2 P{o'+^\0\q^=''\o^))K^='-\e'+\q^='-^''+\o'+^)), (A.IO) 

ot+i 

where g'^=*~i'*+i(o*"'"^) is the depth-(i — 1) joint policy that results from qr'^=*'* after obser- 
vation of o*"^^. 

If we define F^~*'*(^*,g*,/3*^'^) to be the expected reward for stages t + i,t + i + l, . . . ,h — l. 
That is, the time-to-go t = i denotes how much time-to-go before we start accumulating 
expected reward. The 'in /c-steps' expected return is then given by 

F*(e"'*,g*,/3f+^) = Fr''*(^"'*,g*,/3f,f ). 
The evaluation is then performed by 



r^^\e\q\^\+^) = Qk{e\q\p\+'') 



F, 



t+l\TpT=i-l,t+l,*fP^t + l t+1 



,q 



(A.ll) 
(A.12) 



r,t + l 



where = o /3*+'^)(o*"'"-^), and where 



= maxFn'*(^"**,g*,/3*+'=^ 



ot + fe 



\k\ 



(A.13) 
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Theorem A.l (Shorter communication delays cannot decrease the value). The optimal 
Q-value function Qk of a finite horizon Dec-POMDP with k-steps delayed communication 
is an upper bound to Qk+i, that of a k + 1- steps delayed communication system. That is 

VtV,-,V^.=,,^^.+. Qfc(^"'*,g^='='*,/3f+'=) > max Qfc+l(e"'^(g-='='*o/3f+'=),/3f+^\+^ (A.14) 

Proof. The proof is by induction. The base case is that (A.14) holds for stages h — {k + 
1) < i < /i — 1, as shown by lemma A.l. The induction hypothesis states that, assuming 
(A.14) holds for some stage t + A;, it also holds for stage t. The induction step is proven in 
lemma A. 2. □ 

Lemma A.l (Base case). For all h — k — 1 < t < h — 1, the expected cumulative future 
reward under k steps delay is equal to that under k + 1 steps delay if the same policies are 
followed from that point. That is, 



and M gh~k-i^ „T=k,h-k-i ah-i 



V/,_fe<t</._iV,-,v,.=.-M Ql{e\q^=''-''') = QUi{e\q^=''-'^'), (A.15) 

Qfc(0"*^-'=-\g^=^''^-'=-\/3f^,yi) = Qu+i{e''-^-\{q^=^'^-''-^ o p'^-^)). (A.16) 

Proof. For a particular stage t = h — t' with h — k<t<h— 1 and an arbitrary 6^,q'^^'^''^, 
we can write 

QU0\q^=^''') = QU,{e',q^=^'^'), 

because both are given by the evaluation of (A. 5), and this evaluation involves no actions: 
Basically (A. 5) has reduced to a Markov chain, and this Markov chain is the same for 
and Q^_|_i- We can conclude that 

yh-k<t<h-i^0t^gT=T',t Q*k0\<f~^ '*) = '3fc+i(^*,9^~^ '*)• 

Now we will prove (A.16). The left side of (A.16) is given by application of (A. 3) 

Qfc(^"*^-^--i,(z^='='^-^-\/3f^,yi) = R{e^-^-\q^=^'^~^~\d^))+ 

P{o''-^\P'-^-^,q^=^^^-^-\ofi,))Ql0^-^,q^=^^^-^'' 



with q'^-^'f^ ^ = l^qT-k^h k i ^ ^h^^ ^){o^ ^). The right side is given by application of (A. 5) 

Qfc+l(^"'^~'"\(g^='''^-'-' o/3|'*^-l)) = R{9^-'^~\q-=^'^-^-\d^)) + 

P{o^-^\P'-^-^ ,q^=^^^~^'^{oq))Ql^^{d^-^ ,q^=^^^-^) 

Qh—k 

with q'r='^^h-k ^ i^^T=k,h-k-i Q ^^yi)(o'^-^). Now, because the policies q'^=^^^-^ are the 
same, we get 

m*lQh—k T=k,h~k\ /Q* /nh—k T=k,h—k\ 

and thus (A.16) holds. □ 
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Lemma A. 2 (Induction step). Given that 



V,-.V^_,,,^^,+.. Qk{e'' ,q^=''''' ,^1,1") > max^Q,+i(^"'^(g-=^^'*'o/3f;+'^),/3f^^^^^ (A.17) 

/3|, 



fc+1 



holds for t' = t + (k + 1), then 



■* /r,T=k,t ^ iQt+k\ nt+k+l\ 



^\k\ - . - |".| - nt + k + l 

P\k + 1\ 



holds for stage t. 

Proof. For the A;-steps delay Q-function, we can write 



Po{o'+^\e\q^=^^\d^)) max \Ku{e'+\q'+^) + Fj^\9'+\q^=^^'-^\p\+^+^) (A.19) 



ot+k+i L 
P\k\ 



where g^-'^'^+i = (g^~'^'*o/3*+'^)(o*"''^). Because Kk is independent of Z?*^'^'*'^, we can regroup 
the terms to get 



+ 



i?(0"'*,g^=^'*(o0)) + Po{o'+')Kk{e'^\q^=^^'+^] 

ot+i 

Y Po{o'^') max Fi+\e'^\q^='^''+\p\^^^+'] 



qi+fe+1 



(A.20) 



In the case oi k + 1-steps delay, we can write 

(A.21) 

where, per definition (by (A. 9) and (A. 10)) 

Ki,+^{9\{q-=^'' o /3f+'^)) = R{e\q-='^^\o^)) + ^ Po{o'^^W=\9'+\q^=''''^^), 

ot+i 

= R{9\q^=''\o^)) + Y Po{o'+')Kk{0'^\q^='''+'), (A.22) 

where q^=^'t+i ^ ^^T=k,t ^ 

Equation (A.22) is equal to the first part in (A.20). Therefore, for an arbitrary 9^,q'^^'^'^ 



and I3\u\^, we know that (A.18) holds if and only if 



EP r^t+l^ TTiQv 7?*+l/'fl*+l r,T=k,t+l ot+k+l\ ^ ,^„^ /^i / T=k,t ^ Rt+k\ /^t+fc+l^ 



qi+fe+1 



□ t + fe+1 



(A.23) 
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where k,t+i _ k,t ^ ^t+fe^^^i+i^^ When fiUing this out and expanding F^j^^ using 
(A. 12) we get 



r,t + k + 

P\k\ 

max o /3f+^) o (A.24) 

^|fe+i| o*+l 



This clearly holds if 



^ P.(o*+i) max o /3f+^)(o*+i) o /jf+'^+i)), (A.25) 



ot+k+l 
jt+1 P\k\ 



because the second part of (A.25) is an upper bound to the second part of (A.24). Therefore, 
the induction step is proved if we can show that 



V„.+iV^.+.+i F;=^'*+i(e"**+i,(g-=^'* o /?f+^-)(o*+i),/3f+^+i) > 



which through (A. 13) and g'^=^''*+i = (^qr=k,t ^ ^t+fc^j^^t+i-j transforms to 



'\k\ 

Vy yy rpr=k,t+l , T=k,t+1 ot+k+l\ ^ 

VgT=fc,t+iV^t+fc+i i*^ [d ,q ,/?|^| ) > 

\k\ 



max F;;f '*+i(^"**+\(g-='='*+i o /3f+'=+^),/3f+if ). (A.27) 
^Ifc+i| 



Now, we apply (A. 11) to the induction hypothesis (A. 17) and yield 



^ife+ii 

(A.28) 

Application of lemma A. 4 to this transformed induction hypothesis asserts (A.27) and 
thereby proves the lemma. □ 

Auxiliary Lemmas. 

Lemma A. 3. //, at stage t, the 'in i-steps' expected return for a k-steps delayed system is 
higher than a (k + l) -steps delayed system, then att — 1 the 'in (i + l) -steps' expected return 
for a k-steps delayed system is higher than the (k + 1) -steps delayed system. That is, if for 
a particular q^=k^^ = (^qT=k,t-i ^ ^*-i+^)(o*) 

I ^ 1 

max Fm'\e\{{q-='^''-' o /3f+'=)(o*) o ) (A.29) 

^lfe+il 
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holds, then 



k '9 'P|fc| j>maxi^^^^ [ti ,{q o fj^^^ /'P|fc+i|)- 

(A.30) 



Proof. The following derivation 

,g ,P|fc| j 



F, 



max 

ot + k 

P\k\ 



max 

ot+k 
P\k\ 



> max 

ot + fe 



= max F, , 

P\k + 1\ 



P\k+i\ 

.P\k + 1\ 

>t+k 



ot + k ^ ot + k + \ f^+i 

P|fc+i| o« L'^|fc+i| 

^^^^k+1 'W °P|fc| )>P|fc+l|) 

the lemma. 



proves the lemma. □ 
Lemma A. 4. If, for some stage t 



holds, then VjV^ 



flt-i „T = k,t — i at — i + k 
» ,9 ,P|j.| 



[u ,q ,P|,.i ) > max i^. 



T=i,t — i (nt 



'\k\ 



t-i + k + l 



P\k+1 



fc+1 



(A.32) 



Proo/. If (A.31) holds for all e\ q^='''\ then eq. (A.29) is satisfied for all e\ q''='''\ 



/3|^^, and lemma (A.3) yields V„->t_i ,=fc,t_i ot-i+k 



T=l,t~l (Tit-l T=k,t-1 ot-l+k 



/3| 



> max F, 



r=l,t-l^^t~l^^^r=k,t-l ^ ^m-l^^^m (A.33) 



ot+k 
P\k + 1\ 



fc+1 



At this point we can apply the lemma again, etc. The i-th application of the lemma yields 
(A.32). □ 
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